How do you guys store tick data?

mizhael · Jun 10, 2010

Quote from Bob111:

they load differently. binary file -you just open it and load into memory. sql-to load you have to go thru each record. the difference in time and performance is huge. i remember in old times calculations with binary that take an hour can take like 5-6 hours when i was using ms access stored tick data

Yep, but you cannot query into binary files fast right?

What if you need 5 milli-seconds' aggregated mean/max/min prices?

mizhael · Jun 10, 2010

Quote from mizhael:

Yep, but you cannot query into binary files fast right?

What if you need 5 milli-seconds' aggregated mean/max/min prices?

Not to say alignment of multiple series at the milli-second level...

johnnyqpublic · Jun 10, 2010

Quote from mizhael:

Lets take 6 months intraday tick data (only last trade, not bid and offer, and not market depth data),

and lets take QQQQ,

you can store the whole data in memory in R?

I'm showing around 300 MB for a month of the tick data as you described (I specifically checked QQQQ). Let's say that six months, then, is around 2.5-3 gigs (I'm padding the estimate a bit). Then yes, R can store it quite easily, assuming you have the RAM/swap available. I've got 8 GB on this machine, so it isn't a problem.

promagma · Jun 10, 2010

Quote from mizhael:

Yep, but you cannot query into binary files fast right?

What if you need 5 milli-seconds' aggregated mean/max/min prices?

You won't be able to use SQL, but do you really want to rely on a database engine to aggregate data? Stick with reading in raw tick data and aggregate in your C++ code. Not hard and there is no faster way to do it.

promagma · Jun 10, 2010

Quote from mizhael:

Not to say alignment of multiple series at the milli-second level...

Your data should be pre-ordered, as in a binary file or HDF5. Then aligning the data is easy. Again you can probably do this faster with your own multi-threaded code rather than relying on a database engine.

EDIT: I just saw that you can get your hands on KDB, which is pretty awesome. I'm guessing KDB is built to do exactly this kind of stuff, so don't want to imply that you can beat that.

mizhael · Jun 10, 2010

Quote from johnnyqpublic:

I'm showing around 300 MB for a month of the tick data as you described (I specifically checked QQQQ). Let's say that six months, then, is around 2.5-3 gigs (I'm padding the estimate a bit). Then yes, R can store it quite easily, assuming you have the RAM/swap available. I've got 8 GB on this machine, so it isn't a problem.

Okay, so you are saying R has no memory limitation (only bounded by hardware RAM) while Matlab has memory limitation?

mizhael · Jun 10, 2010

Quote from promagma:

Your data should be pre-ordered, as in a binary file or HDF5. Then aligning the data is easy. Again you can probably do this faster with your own multi-threaded code rather than relying on a database engine.

EDIT: I just saw that you can get your hands on KDB, which is pretty awesome. I'm guessing KDB is built to do exactly this kind of stuff, so don't want to imply that you can beat that.

Yeah, I used to write low level C++ code to process tick time stamps, etc. Debugging is painful. Also not too fast - sorting timestamps on a 1GB data takes quite a while. [For sorting only, structured softwares such as SAS is much faster, which made me think raw C++ data processing is not the most cost-efficient route.]

So I came up with the idea of KDB. Anybody has experience with KDB vs. C++ for tick data processing?

johnnyqpublic · Jun 10, 2010

Quote from mizhael:

Okay, so you are saying R has no memory limitation (only bounded by hardware RAM) while Matlab has memory limitation?

I can't speak for all platforms, but R on 64-bit Linux will only hit address space limitations imposed by the system architecture, as far as I know.

promagma · Jun 10, 2010

Quote from mizhael:

Yeah, I used to write low level C++ code to process tick time stamps, etc. Debugging is painful. Also not too fast - sorting timestamps on a 1GB data takes quite a while.

I'm trying to say, if you (or the database engine) is sorting data you are already in big trouble. I have 100GB of data so I've been down that road!

The recommendation for binary file or HDF5 is because you can store it already sorted. Then reading in your stream of data is just a straight shot from the disk. The only bottleneck is I/O, and if your data is compressed that can be minimized.

dinn13 · Jun 11, 2010

KDB+ is a decent way to go. Never liked using Q and when I've used it before the KDB database was shared and people would always get on my case when I would run heavy queries on it. So in the end just used it as a back up for tick data.

Now a days when I want to access tick data directly then I use FastTick (now a Reuters product, http://quant.thomsonreuters.com/) which has its own proprietary database for accessing TAQ data. It's really fast.

Otherwise I'll create csv files or binary files (stored as serialized java objects) that aggregates the data any number of different ways and then run my backtesting apps over that. Have found this to be the fastest way when optimized.