More stats:
Splits 31996
Dividends 167156
Symbol Changes 102489
Trading Halts: 67408
Historic Symbols: 102489
Splits 31996
Dividends 167156
Symbol Changes 102489
Trading Halts: 67408
Historic Symbols: 102489
Quote from PocketChange:
We are now able to locate and access any tick of any instrument nearly instantaneously (<10ms). The data is stored multiple times using different optimizations for accelerating performance.
Quote from hft_boy:
With all due respect -- what is the point of having a database and being able to locate random ticks quickly? First of all, 10ms is not fast -- that's 100 ticks a second. Second of all, most backtesting is done over sequential ticks. That means that you don't really need fast random access, but fast sequential access.
Using flat binary files is blazing fast on modern computers. I can 'parse' (i.e. not actually do any computations with the data) 7 GB in about 7 seconds on a 2 GHz CPU. That's about 30 ns a message. So why use databases? They're bulky, slow, and unless you can give me a good reason, really not the right type of tool for most trading type applications.
Quote from hft_boy:
With all due respect -- what is the point of having a database and being able to locate random ticks quickly? First of all, 10ms is not fast -- that's 100 ticks a second. Second of all, most backtesting is done over sequential ticks. That means that you don't really need fast random access, but fast sequential access.
Using flat binary files is blazing fast on modern computers. I can 'parse' (i.e. not actually do any computations with the data) 7 GB in about 7 seconds on a 2 GHz CPU. That's about 30 ns a message. So why use databases? They're bulky, slow, and unless you can give me a good reason, really not the right type of tool for most trading type applications.
25 ms? so, i'm guessing this was done using nanex as data source? they're the only ones i know who use that arbitrary timeslice. if you're not using them, i'm curious why you chose it.Quote from PocketChange:
Bars25ms
so, what do you record then on the trigger? midpoint? if so, most illiquid things quote garbage a lot of times (around the open especially, and all the time if it's really illiquid), so mids/quotes are near useless. how do you get around this?We build out our bars differently using ask/bid changes as the trigger and not last trade data.
Quote from PocketChange:
Locate and load streams of consolidated records for any instrument starting at any point in time in 10ms.
Our data structures reduce the record count to an average of 3% that of the fix messages. Queries are optimized to return result sets of just the actionable events your interested in aka bid/ask changes... correlation triggers etc.
.Quote from propseeker:
25 ms? so, i'm guessing this was done using nanex as data source? they're the only ones i know who use that arbitrary timeslice. if you're not using them, i'm curious why you chose it.
so, what do you record then on the trigger? midpoint? if so, most illiquid things quote garbage a lot of times (around the open especially, and all the time if it's really illiquid), so mids/quotes are near useless. how do you get around this?
anyway, kudos. i'm sure this took a bit of work.
Quote from hft_boy:
Using flat binary files is blazing fast on modern computers. I can 'parse' (i.e. not actually do any computations with the data) 7 GB in about 7 seconds on a 2 GHz CPU. That's about 30 ns a message. So why use databases? They're bulky, slow, and unless you can give me a good reason, really not the right type of tool for most trading type applications.
Quote from DeeDeeTwo:
Custom flat files are 10 to 100 times faster than databases...
And can be serialized in RAM for even greater speed.
You need an very high level of complexity in your data analysis...
To justify using SQL databases....
Which probably means you are overfitting.
Keep it simple... it's all about execution anyway.
+1 (still good after all these years!)Quote from inflector:
I've got an eclectic background. Started programming in high school over 20 years ago writing futures trading systems. Had a bit of fame in my early twenties as a trader and then left for 15 years to start a few software companies.
One of them sold an embedded database which was the number one product on the Macintosh. I worked on the internals, disk access, etc. as well as the query optimization.
There is a huge difference in read time between a database and a binary file unless the database has been specifically optimized for large binary data storage (known as BLOBs in the business).
The reason is simple, even in a database with an efficient caching mechanism large data sets generally involve multiple reads from the disk because the data is split up into chuncks. Every separate read will take a while because on average it will require 1/2 of a rotation of the disk before the data comes under the read heads so the read can start.
Unlike almost every other aspect of computing, disk speeds have not followed Moore's Law. Disks are maybe 30 to 100 times faster than they were 20 years ago while computers are 10,000 times faster.
Even a 10,000 RPM disk takes 6 milliseconds to rotate. So you only get 167 rotations per second. That's a lot of time when computers are doing billions of instructions per second.
For tick data analysis the speed of reading the data is the determining factor for the speed of testing unless you have very inefficient code or are doing esoteric analysis.
So I suggest storing information about your data in a database but storing the physical data on the disk in raw binary files.
You can get acceptable performance from a database if you know what you are doing, however, you will always pay a performance penalty.
- Curtis