Just got done architecting an expansive equities tick repository.
Some stats:
Symbols 20,653
Period: 2008 - 2012
Bars25ms 27,741,118,213
Bars1Sec 14,007,345,833
Bars1Min 2,141,219,516
Messages 742,640,774,253
Ask Changes 24,825,915,500
Bid Changes 24,722,845,608
Orders 37,906,709,939
Volume 10,485,098,567,764
Dedicated Servers: 5
Data Storage: 20TB
We chose a hybrid Hadoop style implementation with SQL access.
Being I/O bound was an understatement.
We are now able to locate and access any tick of any instrument nearly instantaneously (<10ms). The data is stored multiple times using different optimizations for accelerating performance.
Different Structures are used for pairs analysis, graphing bars, index analysis etc. Extensive Use of Covering Indexes (where the index contains the answer data).
One of our driving forces to build out this data repository was that the consolidated data commercially available was fundamentally flawed being built around last trade data. Exchange Tape data is too slow to process for most of our algos.
We build out our bars differently using ask/bid changes as the trigger and not last trade data. Consequently our back tested results nearly match our real time executions. This is especially true when trading pairs and other cross exchange correlated instruments.
We're contemplating making access to these structures available as a service... Renting out VM's with direct access to our 20TB repository... Send me a PM If your interested.