I agree with everything including that I'm not inventing anything new by using binary.
But there's one thing I hope you will elaborate on and one correction to make. See below:
Quote from byteme:
Storing tick data in binary format in files (one file per symbol) is the way to go for tick stream playback/back-testing etc.
However, there are many use cases where offline analysis of the tick data is necessary - the binary files are not normally amenable to this kind of analysis.
Q: What kind of analysis? I confess I'm clueless to any other use of tick data than for historical testing and this is most excellently done in binary format.
The solution, or rather, ONE solution, is to have the necessary tools to easily transfer data from binary format to RDBMS etc. when needed.
Q: You "sound" as though this is something you have actually done. And yet, being an RDMBS expert myself and having tried to load tick data into several I found the usability impossible to load a tick per row. Most databases have trouble after a few million rows of data. In this case, for one symbol we're talking 100 million rows (when storing every single change of DOM) per year.
I worked at one company with THE largest Oracle installation in the country and they had to move to TeraData DB to handle that kind of volume. Forget about Berkeley, MySQL, etc.
So you "sound" like you've done this but I can't see how that's possible. So please elaborate.
The data conversion/transfer time is not normally critical for offline analysis.
Oh really? Not critical? Did you catch the number of ticks? Let's say it only takes you
10 milliseconds to process each tick.
Let's do the math. On 10 millions ticks, that will take 27+ hours to process.
I beg to differ, it seems just the opposite. Real time processing is far easier. That's why so many platforms do that but fail at historical testing of ticks.
It's MUCH more important to have the fastest possible speed during offline testing. TickZOOM handles them at 1 microsecond per tick, that means in real time it can handle 500,000 to 1 million ticks per second. No exchange will generate that many ticks per second even if you're tracking dozens of symbols.
so it's the off line analysis where speed is most critical.
Naturally, the API for data persistence can be pluggable so that other storage mechanisms can be deployed as the user deems fit.
Certainly. Agreed.
First, you need to define what is needed in this API.
As far as a proprietary binary file format, going with a symbol per file is not a bad idea. Symbols can be put onto different disks if needed as disk I/O becomes a bottleneck.
That won't be an issue. Again people keep focusing on the wrong problem.
Remember, it takes only 3 seconds for TickZOOM to load an entire 100Meg file. But it takes the CPU 40 seconds to process that file through the engine.
So what happens when you add another symbol? The CPU will now take 80 seconds and loading only 10 seconds.
In other words, until we get massively parallel processing (and even then) disk still won't be the bottle neck. The disk will continue to be faster than the processing. That's due to limitations in parrallel processing time series data (it has dependencies.)
The alternative is to have a virtual file-system within a file on disk but this cannot be de-fragmented as easily and doesn't have the above mentioned benefits.
You may also want to consider an appropriate level of compression - not for saving disk space, as that is cheap, but for reducing the time taken to read the data from disk.
Okay but we're still working on the wrong problem. The bottleneck here is the CPU. Decompressing, therefore, will make it
even slower, not faster because it takes more CPU to perform the decompression while loading.
Proprietary binary file formats for storing tick data has been done a bunch of times and you aren't inventing anything new here but it's nice to see you being so enthusiastic about it.
Thanks for your comments. And I'm sure you're much smarter and more skilled at database than I am. I mean that sincerely. However, this is not a database problem. It's a CPU problem.
And, obviously, I do okay in that area (but I can do better).
[/QUOTE]