Quote from janus007:
Hi Wayne
I have been thinking about storing ticks as binary files, I haven't tested it though, but it could be interesting.
To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem
http://research.microsoft.com/apps/pubs/default.aspx?id=64525
Excellent reference on BLOBS!!!
Okay, here's some facts.
The article recommends that objects less than 256K be stored in the database.
When collecting full ticks with every changed of the dom, that's currently about 100Megabytes per week on a single currency pair.
(I'm actually planning to add a binary diff between ticks since they have minimal change between ticks--to reduce the file size.)
Still it will be over 100 Megabytes per week.
That 100 meg file takes about 2 seconds to load into memory without the engine running.
I don't think making a separate BLOB per day makes sense. Too granular.
Weekly blob is fine.
Therefore that article recommends the file system be used for files that size. I agree.
Folks, it will be a lot easier when sharing these files to just receive a file and plop it into the right folder so TickZOOM can find it than to have to load into some database.
Of course, if it's not already in TickZOOM format, you will need to run a conversion.
Also, there's an API to select ticks if you need to for some purpose--the same API TickZOOM uses to load ticks. Just submit a symbol and date range.
So, especially after reading that article, I plan to go with a data folder which has a sub folder for every symbol used. Within each symbol folder it will have one folder per year.
Inside each yearly folder will have just weekly tick files.
The tick files will have a header which has handle for every 10000 ticks in the file. Each handle gives the timestamp and file offset.
A weekly file will have approximately 1 million ticks.
At 10,000 ticks per "chunk" that makes around 100 handles.
So for every new file the engine creates it will reserve a header with space for 2048 handles (for potential expansion).
It will endeavor to store the entire week in that one file and mark the offset to every chunk in the process.
NOTE: One feature I would like to see is for TickZOOM to list the available symbols and date ranges available.
So on startup, TickZOOM can scan all the headers and make an index.
It can automatically check file modification dates in the directories at each startup to see if there's a new file dropped in.
Doesn't that solve it?
Databases only give value when objects or rows need to reference each other like joins or object references.
In the case of ticks, there's none of that. Just raw time series data.
Sincerely,
Wayne