How do you guys store tick data?

mizhael · Jun 14, 2010

Quote from PocketChange:

SQLite works well... for Excel users checkout dhRichClient from thecommon.net. Allows Excel users to easily integrate Sqlite databases in worksheets without the ms overhead and even utilize SQLite as an application file format. In-memory SQLite databases are blazing fast.

SQLite vs. KDB?

januson · Jun 14, 2010

Quote from mizhael:

Well I would imagine if the DB have the smart indexing and caching, etc.

it will be faster than plain binary on an average aggregated basis.

No they will not, binary is the lowest you can go. A DB has to read the whole file stream into memory/ disk and aggregate at that, it will be exactly the same as if you created your own aggregation at a stream. But the difference is that your own solution is specialized and the DB isn't which means overhead caused by features you eventually don't need.
As a 2nd reason and maybe the best, when storing tick data in files (another poster suggested the same) you can create the files based by a time scale, hours, days etc., you have to decide that when you write your specification.
Writing your own solution WILL always perform better/ faster, but maybe not as versatile and that is exactly what your decision should be based upon

WinstonTJ · Jun 15, 2010

Quote from uexkuell:

Plain binary files.
Can't beat them. With todays fast computers any superimposed database layer will create unnecessary overhead (speed and code).

Datafields:
Byte key (trade, dom bid/ask event, volume, others)
Single price
Long volume
Long timestamp (millisec from 00:00 same day)

One file per symbol per day.
Very simple to access, search, analyze.

^^^^ This. I'm building out three years of tik data for the Russell 3000 in a modified Backblaze (http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/).

All Binary, all accessed via a custom application & backtest engine.

TrailingStop3 · Jun 15, 2010

I use single files to store tickdata. Form the name of the symbol I create a folder which contains all the files for that symbol. For each tradingday I create a new file in this folder. The date is the filename, so it is very easy and fast to access it. The data is z-lib compressed to reduce the time to read the quotes from the file since decompression in memory is faster that reading from a disk. I store trades/bid/ask, each with price, volume and timestamp with milisec resolution. From this base I can create all needed timeframes for charting and you can also do backtesting with âlook inside barsâ based on bid/ask rather than tradesâ¦.

Daniel

januson · Jun 15, 2010

Quote from WinstonTJ:

^^^^ This. I'm building out three years of tik data for the Russell 3000 in a modified Backblaze (http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/).

All Binary, all accessed via a custom application & backtest engine.

Thank you very much for this hint

mizhael · Jun 18, 2010

Quote from WinstonTJ:

^^^^ This. I'm building out three years of tik data for the Russell 3000 in a modified Backblaze (http://blog.backblaze.com/2009/09/01/petabytes-on-a-budget-how-to-build-cheap-cloud-storage/).

All Binary, all accessed via a custom application & backtest engine.

wow... admiring...

why not KDB?

Rodney King · Jun 18, 2010

Quote from mizhael:

why not KDB?

Ummm, because it costs a lot of money?

test_user · Jun 19, 2010

I store it as binary data. Here's how:

one file per day

HEADER:

first 3 bytes are date (year+1900, month and day)

then 3 bytes (unsigned) is the time of the first tick in the day (time is measured in seconds since 00:00)

then 4 bytes (unsigned) for the price of the first tick
and 2 bytes (unsigned) for volume of the tick

THE REST:

For every tick {

3 bytes (unsigned) for a change of time since last tick in seconds. Usually 0 or +1

2 bytes (signed) for a change of price since last tick in ticks, which is usually 0, +1 or -1 (that would represent a price change of 0, +0.25 and -0.25 for ES)

2 bytes (unsigned) for volume of the tick

}

The byte array is then zipped (compressed). Compression is very effective since most of the bytes are either 0, 1 or -1.

Storing data in non-binary format (like CSV) is very inefficient space-wise and performance-wise. All textual data have to be parsed. I experimented with binary vs. text-based data and found that parsing text data is about 8 times slower than reading in binary data. And the data size differences are very obvious. Binary+compressed data takes up about 20 times less hard disk space than CSV data.

I haven't experimented with databases. But in my case using a DB would be a huge and unnecessary overhead.

Engine99 · Jun 19, 2010

I don't store tickdata but if I would I'd store it in SQL.
I store 800 million quotes in mysql (daily, weekly and 15 second quotes) and run various tests against it.

A relational database can never be faster than a csv or binary format when it comes to raw data processing. However the way my backtesting algorithms work is they take advantage of some of the convenience of a relational database, I'd hate to duplicate that myself by processing a csv/binary file.

Most of my actual backtesting is done from in memory, so there's not much difference once the structure is loaded. It will take longer to load the content from a mysql database, but then again, I can easily compare gaps with gaps on the S&P 500 without having to come up with that myself, I can just use standard (ansi) sql. I guess it depends on what your trading timeframe is, whether or not you want to correlate your strategy to index performance to optimize it ....

For my type of trading, I'd gain very little processing the data binary/csv.

Stoxtrader · Jun 19, 2010

Quote from mizhael:

Your backtest system must be coupled with a nice/fast/compressed database...

Ideally, the backtest program (say C++) query the database and you run backtest on small trunks a time...

So how do you do it?

What database do you use?

Thank you!

( I am thinking KDB together with Matlab)...

PostgreSQL for persistent store but I frequently cache data in memory, just as arrays but you could use b-trees, tries, hash map, etc. depending on your needs.

Note, even with PostgreSQL which I've had zero problems with, I make backups.