How do you guys store tick data?

ET151 · Jul 10, 2010

Quote from Stoxtrader:

Make sure to test any of these new NoSQL databases like Tokyo Cabinet before using them on an actual system that trades real money. For example there are reports of MongoDB dropping/deleting/corrupting data. For example if you disable any logging, consistency checks, etc from PostgreSQL it's as fast as any NoSQL database... at the expense of no logging and consistency checks. "Yay it's blazing fast!" "Hey uh where did my last 6 months of data go?" If you need blazing fast I would recommend using an in-memory database and/or solid state drives.

I have not heard anything negative about Tokyo Cabinet, I'm just sayin'.

I have not tested it out yet, but Tokyo Cabinet claims to store 1 million records in under a second (hash table mode) or 1.6 seconds for B-Tree mode. Are you sure PostGRESQL can keep up with that even without consistency checks? Million record stores in 1.6 seconds is pretty damn fast.

http://www.igvita.com/2009/02/13/tokyo-cabinet-beyond-key-value-store/

It's doesn't look too hard to setup so I hope to test it out soon.

Update: I checked, even SQLLite claims to be much faster than PostgreSQL. (http://www.sqlite.org)/speed.html)

nbates · Jul 10, 2010

solid state drives & flat files...a database in the path of execution or anywhere near market data is a pox on quality

ET151 · Jul 10, 2010

nbates, I've asked this question in another thread...if you go with flat files, are you able to access data in that file while the file is being written? Benefit of database is that you can read and write to it concurrently (single writer, multiple reader).

ET151 · Jul 10, 2010

Umm...just found something better than Tokyo Cabinet...

http://1978th.net/kyotocabinet/

Features

Now that all of my plans have been achieved, Kyoto Cabinet has the following features. Especially, Windows support is remarkable.

* time efficiency: Throughput of updating is more than 100 millions query-per-second.

* space efficiency: Footprint for each record is 8-16 bytes in the hash DB, 2-4 bytes in the tree DB.

* concurrency: The hash DB uses read-write lock for each record. The tree DB uses read-write lock for each page.

* usability: Generic operations of database by interface like the "Visitor" pattern are provided.

* robustness: Manual transaction, auto transaction, and auto recovery are provided.

* portability: UNIX-like systems (Linux, FreeBSD, Solaris, Mac OS X) and Windows (VC++) are supported.

* language bindings: C++, C, Java, Python, Ruby, and Perl are supported.

Compared with Tokyo Cabinet, KC is superior in concurrency, usability, and portability. Although time efficiency for single-thread is better in TC, I recommend KC from now on because multi-core/many-core CPU has been popular. However, I will keep on maintaining TC and fix bugs if they are found.

(http://1978th.net/tech-en/promenade.cgi?id=7)

nbates · Jul 10, 2010

It really does not matter whether or not you have asked the question in another thread, I answered it in this one and a database is "nothing more" than a flat file for those who do not know and/or understand how do use threads and critical sections to implement concurrent store and fetch trade history and time-series storage systems.

Database is one-size fits ALL, don't try running the 100 yard dash expecting to win in those shoes at the Olympics, lol

ET151 · Jul 10, 2010

Quote from nbates:

It really does not matter whether or not you have asked the question in another thread, I answered it in this one and a database is "nothing more" than a flat file for those who do not know and/or understand how do use threads and critical sections to implement concurrent store and fetch trade history and time-series storage systems.

Database is one-size fits ALL, don't try running the 100 yard dash expecting to win in those shoes at the Olympics, lol

Yes, but as I asked previously, are you actually able to read all the information that you have written to your flat file while it is open and you are writing to it without having cached all the contents of the file in memory? Question here is whether it is worth developing such code or simply use a very powerful database solution that I can have running in less than a day. I can always write to the database during the week and then dump to a file when market closes. Not claiming that this is the ultimate, final approach, but I like leveraging other people's work as much as possible.

nbates · Jul 10, 2010

Good question!

The approach I've found best is to cache in memory a certain amount of data and periodically based on time or the amount push chunks out to disk appending to a file. When requests come from the client application and more than what's currently in cache is required, read the file and append what's current in the memory cache, if there is any, to the stored data that was fetched.

I use a critical section around functions like "add_point" and "get_series" which are each driven by a different thread and 99.9% of the time there's never contention and when there is it's undetectable.

The thing is, with a database you can only do some number of transactions per second (pick a number) and if you are storing each bar on 40,000 stocks at whatever interval [I do it on a one-second interval] then you are limited by the database.

Instead, store bars in memory...compress them using something like a "rep_count" which means if the last bar equals the next bar to store, then set "rep_count=2" for example and you can do a hell of a lot more that fits with what you're trying to accomplish, if performance is the goal or an objective!

Stoxtrader · Jul 10, 2010

Quote from ET151:

I have not tested it out yet, but Tokyo Cabinet claims to store 1 million records in under a second (hash table mode) or 1.6 seconds for B-Tree mode. Are you sure PostGRESQL can keep up with that even without consistency checks? Million record stores in 1.6 seconds is pretty damn fast.

http://www.igvita.com/2009/02/13/tokyo-cabinet-beyond-key-value-store/

It's doesn't look too hard to setup so I hope to test it out soon.

Update: I checked, even SQLLite claims to be much faster than PostgreSQL. (http://www.sqlite.org)/speed.html)

One config change in PostgreSQL makes the speed similar to CouchDB, Tokyo Tyrant, Redis, MongoDB, Cassandra and Project Voldemort, and PostgreSQL can be tuned to make it even faster:
http://www.pgcon.org/2010/schedule/attachments/141_PostgreSQL-and-NoSQL.pdf

ET151 · Jul 10, 2010

Quote from nbates:

Good question!

The approach I've found best is to cache in memory a certain amount of data and periodically based on time or the amount push chunks out to disk appending to a file. When requests come from the client application and more than what's currently in cache is required, read the file and append what's current in the memory cache, if there is any, to the stored data that was fetched.

I use a critical section around functions like "add_point" and "get_series" which are each driven by a different thread and 99.9% of the time there's never contention and when there is it's undetectable.

The thing is, with a database you can only do some number of transactions per second (pick a number) and if you are storing each bar on 40,000 stocks at whatever interval [I do it on a one-second interval] then you are limited by the database.

Instead, store bars in memory...compress them using something like a "rep_count" which means if the last bar equals the next bar to store, then set "rep_count=2" for example and you can do a hell of a lot more that fits with what you're trying to accomplish, if performance is the goal or an objective!

Yes, for that application, I would be working with flat files as you are doing. For now, I found a very simple solution similar to what's described in this thread:

http://www.daniweb.com/forums/thread272775.html

I plan to stick with CSV files for now...eventually I will convert them all to binary, but that's lower on my list of things to do. I can do what WinstonTJ suggested in that other thread I mentioned earlier (if you want to see it, it's one of the 4-5 posts that I have made on here). Basically, I have one computer logging my data. I open up the data log directory to my local network as read-only. Then I simply read the log files on the client machine as they are being written. Two ways to do this:

1) On the client computer, open up log file and read until readLine() == null. Then pause and check read-line again in 500 ms, 1 sec, etc (again, this is not an automated trading app...). Only trick is to either ensure only complete lines are written to file...OR if a line does not contain the newline character, backup 1 line (not sure how to do that in Java besides marking every line as I am reading the file).

2) Once all but the last few lines have been written, stop, establish a socket connection with the server computer and subscribe to the instrument of interest. Server computer then forwards all market information over the socket connection. Buffer that data and then advance the file reader until the line read from the file matches the first line of buffered data fed over the socket connection. Then close the file reader and only use data fed over the socket connection.