How do you guys store tick data?

johnnyqpublic · Jun 10, 2010

Quote from const451:

CSV? And if you want to query some part of your data, say one year worth of tick data, what do you do?

If you're storing your data in smaller than one-year segments, read as many in as it takes to get a year. If they're > 1 year, you've got some awfully big files there.

GTG · Jun 10, 2010

I store tick events in binary files, separated out by date. Each security gets its' own directory...so for example all of my "SPY" tick data is in a directory called "STK_SPY". I load the files a few at a time on another thread, as a back-test progresses. 99% of the time, I am interested in looking at the data as a sequence of days...i.e. what happened between x and y dates, so this is very easy to implement these types of queries using this format.

januson · Jun 10, 2010

Quote from uexkuell:

Plain binary files.
Can't beat them. With todays fast computers any superimposed database layer will create unnecessary overhead (speed and code).

Datafields:
Byte key (trade, dom bid/ask event, volume, others)
Single price
Long volume
Long timestamp (millisec from 00:00 same day)

One file per symbol per day.
Very simple to access, search, analyze.

Couldn't agree more.
Though if speed wasn't the issue I would prefer a db like Sql Server 2008

const451 · Jun 10, 2010

CSV and binary files look to me like they require some extra programming efforts to achieve the flexibility of a database. Why not to use SqlServer or MySql, to store data for backtesting purposes. Backtesting does not require real-time performance so one of those relational databases would probably suffice. I've never used KDB but it's used for storing data in real time and that functionality is not needed for backtesting.

promagma · Jun 10, 2010

Quote from mizhael:

Indexing and querying a small trunk from 1-day worth of tick data is still a hassle and slow, am I right?

Slower than KDB... I think.

HDF5 and file system both give you the ability to group your data (grouped by date and subgrouped by ticker, like others have said) and store it pre-ordered by time. So if you only have one type of query, you can structure your data for that and completely avoid table scans or needing to sort anything. It is basically a straight shot from the disk.

I have no experience with KDB but I would guess it is more flexible if you need to query data in many different ways.

Any of this beats relational databases, which have no concept of pre-ordered data, so you can't even do a quick binary search lookup without a hefty index file.

mizhael · Jun 10, 2010

Quote from januson:

Couldn't agree more.
Though if speed wasn't the issue I would prefer a db like Sql Server 2008

Of course speed is a huge issue.

The goals are:

1. Store gigantic amount of data
2. Fast query

mizhael · Jun 10, 2010

Quote from const451:

CSV and binary files look to me like they require some extra programming efforts to achieve the flexibility of a database. Why not to use SqlServer or MySql, to store data for backtesting purposes. Backtesting does not require real-time performance so one of those relational databases would probably suffice. I've never used KDB but it's used for storing data in real time and that functionality is not needed for backtesting.

But backtest still needs to be fast because you do "optimizaton" during backtest...

Bob111 · Jun 10, 2010

Quote from const451:

CSV and binary files look to me like they require some extra programming efforts to achieve the flexibility of a database. Why not to use SqlServer or MySql, to store data for backtesting purposes. Backtesting does not require real-time performance so one of those relational databases would probably suffice. I've never used KDB but it's used for storing data in real time and that functionality is not needed for backtesting.

they load differently. binary file -you just open it and load into memory. sql-to load you have to go thru each record. the difference in time and performance is huge. i remember in old times calculations with binary that take an hour can take like 5-6 hours when i was using ms access stored tick data

johnnyqpublic · Jun 10, 2010

Quote from mizhael:

But backtest still needs to be fast because you do "optimizaton" during backtest...

I generally backtest in R. I can load my data once, then backtest in a variety of different ways (or using different backtest variations, as your comment suggests) without having to read the data again. If I change my code, I just re-source it, but the structure containing my data stays present.

mizhael · Jun 10, 2010

Quote from johnnyqpublic:

I generally backtest in R. I can load my data once, then backtest in a variety of different ways (or using different backtest variations, as your comment suggests) without having to read the data again. If I change my code, I just re-source it, but the structure containing my data stays present.

Lets take 6 months intraday tick data (only last trade, not bid and offer, and not market depth data),

and lets take QQQQ,

you can store the whole data in memory in R?