Tick Database, Now Want to Run SQL

amazingIndustry · Jul 10, 2012

Do you understand the difference between reading persisted data IN THE EVENT OF A CRASH, and reading/consuming data in real-time for strategy computation purposes? I think I said enough.

Quote from propseeker:

Quote from amazingIndustry:

To be honest you do not understand, I remind you AGAIN that I never said I want to save quotes to disk and then read them again in real-time.

Quote from amazingIndustry:

I said I look to implement a solution that can move time series based data from memory to disk as close to real-time as possible in order to recover such saved data in the event of a crash.

the only one here with a reading comprehension problem is you. my comments stand. stop being such a douche.

[/B]

Makis · Jul 11, 2012

Quote from amazingIndustry:

...
I look for a solution that allows me to persist data within the data feed module to disk for recovery purposes. Obviously such routine would run within its own datablock (C# TPL Dataflow) or thread.

A lot depend on the specifics of your design but from the info so far your best bet would be to serialize each message and dump onto a file (or mmap if available, but would need some extra work for mmaps)

I have worked on several market data platforms and all provide native serialize() and deserialize() methods to convert messages to buffers. Chances are that whichever vendor api you use has similar functionality. Other feeds (e.g iqfeed) give you straight buffers. Dump those buffers onto a file on a dedicated thread. Alternatively, you can write your own serialize method and only dump messages after you aggregate them to your own format. There are cons/pros either way.

A database would be an overkill for your purpose, but you then have an option to use it in other ways as well, so it may worth your time. I have never used it, but recently came across MonetDB that looks promising for real time tick data capture.

A better solution (but time consuming to implement) would be to decouple your OMS and Market Data from the same process, and further decouple the market data aggregation and data capture, as you eliminate the possibility of your data capture dragging down the most critical components in case of high volume and data spikes. That also would depend on which asset classes and how many symbols you subscribe for.

amazingIndustry · Jul 11, 2012

Hi Makis, I really appreciate your response and candid and qualified feedback. I realize I too often criticize but I do not praise good things enough on this board. So, please note my gratitude.

a) I do not have issues with serialization/deserialization, I think I would be able to get quotes fast enough into pure binary file storage if I went the file route, rather than db.

b) You really hit a good point here and I should have clarified: I did originally intend to peruse the quote database as access point for other applications, spread sheets,... and that would have been a good argument to implement a data base. Since my last posts I thought about it and I think I need to clearly separate the business logic here. One part is my trading app that needs to access quotes in the event the connection goes down and it needs to re-start quickly and reload the last available time series to re-calculate indexes, indicators, ...

An entirely different application would be a data server/warehouse which would expose access to a database. At a later point I could simply have my trading app connect to the data server through tcp.

c) I agree with your last comments. I also came to the same conclusion. So far I ran a lightweight OMS, PMS, Risk module within each trading strategy container with its own execution gateway and data feed adapters. Several such containers can run in parallel on different markets, asset classes. I am now looking to pull out the OMS/PMS/Risk modules to create a global one that aggregates the whole book (not just each individual strategy container set), and I also look to separately pull out the market data feed. I already ran tests using the open source ZeroMQ messaging API, I am able to send messages even out of process over tcp at about 12 million messages a second which is way fast enough to run the datafeed with aggregation/consolidation modules as separate entity. This should provide several benefits. But I am diverging here away from the database issue.

I guess the database comes in handy as separate data server storage and query medium. I am still undecided when it comes to the exact type of database technology. On one hand I want to store and especially quickly retrieve time series based data (columnar db preferably for that), on the other hand I like to store transactions, TCA results, and query them in a very RDBMS/SQL type of way, meaning, running queries across values not just keys.

Your comments helped me to decide to go with a pure file dump approach within the trading app until I have separately come up with a fast database solution that can store real-time feed data frequently and make them available in the event of a crash. I prioritize the trading app for obvious reasons so the database may have to wait a little but please comment if you have ideas how to get the best of both worlds, pulling time series and running sql-like queries across values, not just keys.

Thanks

Quote from Makis:

A lot depend on the specifics of your design but from the info so far your best bet would be to serialize each message and dump onto a file (or mmap if available, but would need some extra work for mmaps)

I have worked on several market data platforms and all provide native serialize() and deserialize() methods to convert messages to buffers. Chances are that whichever vendor api you use has similar functionality. Other feeds (e.g iqfeed) give you straight buffers. Dump those buffers onto a file on a dedicated thread. Alternatively, you can write your own serialize method and only dump messages after you aggregate them to your own format. There are cons/pros either way.

A database would be an overkill for your purpose, but you then have an option to use it in other ways as well, so it may worth your time. I have never used it, but recently came across MonetDB that looks promising for real time tick data capture.

A better solution (but time consuming to implement) would be to decouple your OMS and Market Data from the same process, and further decouple the market data aggregation and data capture, as you eliminate the possibility of your data capture dragging down the most critical components in case of high volume and data spikes. That also would depend on which asset classes and how many symbols you subscribe for.

PocketChange · Jul 11, 2012

Consolidate the tick database to ask/bid changes and you can run very fast SQL queries.

ie. SPY reduces from 12 to 15 million daily messages into a tick accurate consolidated database of 25,000 records.

Further build and maintain 1 min and daily consolidations as helper tables and you can store all ticks on spy over 5 years in 2GB.

When structured and indexed properly you can query and return small record sets from any point in time in < 100ms. Large data sets load at wire speeds.

An example may be easier to understand.
Say your back testing a sequence of trades strategy and

On 2010-03-03 11:31.17.250 You open a long spy position at 114.05

You need to test for 114.50 for exit or 113.50 for next trade event.

This database structure queries the helper tables loaded as inmemory database to locate the specific minute bar of the events and then attaches and queries the appropriate daily tick table returning results in 100ms.

Based on your specific requirements both structure and indexes can be optimized for ultrafast sql queries.

Really should separate historic db from real time feed and order handling.

amazingIndustry · Jul 11, 2012

thanks for those valuable comments. I already segregate OMS from anything data feed related, even though they ran in the same box. What db would you recommend for such query approach?

Thanks.

Quote from PocketChange:

Consolidate the tick database to ask/bid changes and you can run very fast SQL queries.

ie. SPY reduces from 12 to 15 million daily messages into a tick accurate consolidated database of 25,000 records.

Further build and maintain 1 min and daily consolidations as helper tables and you can store all ticks on spy over 5 years in 2GB.

When structured and indexed properly you can query and return small record sets from any point in time in < 100ms. Large data sets load at wire speeds.

An example may be easier to understand.
Say your back testing a sequence of trades strategy and

On 2010-03-03 11:31.17.250 You open a long spy position at 114.05

You need to test for 114.50 for exit or 113.50 for next trade event.

This database structure queries the helper tables loaded as inmemory database to locate the specific minute bar of the events and then attaches and queries the appropriate daily tick table returning results in 100ms.

Based on your specific requirements both structure and indexes can be optimized for ultrafast sql queries.

Really should separate historic db from real time feed and order handling.

Makis · Jul 12, 2012

Quote from amazingIndustry:

...
I guess the database comes in handy as separate data server storage and query medium. I am still undecided when it comes to the exact type of database technology. On one hand I want to store and especially quickly retrieve time series based data (columnar db preferably for that), on the other hand I like to store transactions, TCA results, and query them in a very RDBMS/SQL type of way, meaning, running queries across values not just keys.
...
I prioritize the trading app for obvious reasons so the database may have to wait a little but please comment if you have ideas how to get the best of both worlds, pulling time series and running sql-like queries across values, not just keys.

Thanks

You got your requirements right, so for whatever is worth to you... I can tell you, you are on the right path, at least design-wise. Thats how the big boys do it, but they have the resources to put different teams/budgets on it. On one hand the real time capture team uses a column based database (kdb+, oneTick) to capture market data and a different team captures transactions on a relational database (oracle, Sybase)

The first database usually has 2 tables (trades,quotes) and queries never do joins, just extract part of the column and real-time calculate average spreads, realized volatility, intraday correlations, volume profiles, vwaps etc...
The transactions database has several tables and any meaningful query joins multiple tables.
The big difference is that the column based database is expected to capture thousands more messages per second than the relational database.

For people that try to build something at home, (usually after returning from work) this would be not be easy. I am not aware of a single product that can match the two database words, but in some kludgy way it can be done. I think that a column based database can accommodate your needs (need some crafty queries, that you will store as custom functions) but not the other way around.

Makis · Jul 12, 2012

Quote from PocketChange:

Consolidate the tick database to ask/bid changes and you can run very fast SQL queries.

ie. SPY reduces from 12 to 15 million daily messages into a tick accurate consolidated database of 25,000 records.

Further build and maintain 1 min and daily consolidations as helper tables and you can store all ticks on spy over 5 years in 2GB.

When structured and indexed properly you can query and return small record sets from any point in time in < 100ms. Large data sets load at wire speeds.

An example may be easier to understand.
Say your back testing a sequence of trades strategy and

On 2010-03-03 11:31.17.250 You open a long spy position at 114.05

You need to test for 114.50 for exit or 113.50 for next trade event.

This database structure queries the helper tables loaded as inmemory database to locate the specific minute bar of the events and then attaches and queries the appropriate daily tick table returning results in 100ms.

Based on your specific requirements both structure and indexes can be optimized for ultrafast sql queries.

Really should separate historic db from real time feed and order handling.

Granted that I have never used a database for backtesting ( and in my humble opinion querying a database is not the proper tool to backtest ) I think that approach severely limits the database to a very specific usage and almost entirely eliminates the possibility of using it for other productive functions.

The problem with 'deltas' and databases is that, even if you can reconstruct the time series, it would not be feasible to do that at real time (that is what i mean other functions)

On the example above, what if the strategy requires the 30 day average realized volatility at 11:31, or the average 15 day average volume between 11:26 & 11:31, or vwap, or maximum runup, etc... ? Reconstructing the historical time series at real-time to calculate those would be impossible timewise.

This schema would probably work great for backtesting, but i do no see why if someone puts the effort in building a historical database, eliminate its usage for real time statistics and analytics queries, or even for analysis on packages such as R.

PocketChange · Jul 12, 2012

As long as you have captured and stored the underlying data or can create it by calculation your not missing much. I would argue a consolidated quote database more accurately reflects the market then standard T&S based data.

A generalized one size fits all approach may work but will not be optimized.

For your example you run queries to build the data structure you want. Set triggers and schedules to maintain it. The key for speed is limiting the data set, optimizing indexes and loading focus result sets as inmemory databases.

The inmemory processes in CPU cycles.

Terabyte repositories are I/O bound, both disk and LAN. If you want speed you Preprocess the historic data and optimize the data structures and indexes for your specific purpose / process.

There are infinite possibilities so rather than try the one size fits all approach or sequentially running tapes use tools to dynamically create, manage and maintain a focused and optimized data set.

The Data is platform agnostic... If you have a columnar tool that can process 5 million records a second...Think about how much faster it can run optimized data structures with 1/100,000 the records.

Quote from Makis:

Granted that I have never used a database for backtesting ( and in my humble opinion querying a database is not the proper tool to backtest ) I think that approach severely limits the database to a very specific usage and almost entirely eliminates the possibility of using it for other productive functions.

The problem with 'deltas' and databases is that, even if you can reconstruct the time series, it would not be feasible to do that at real time (that is what i mean other functions)

On the example above, what if the strategy requires the 30 day average realized volatility at 11:31, or the average 15 day average volume between 11:26 & 11:31, or vwap, or maximum runup, etc... ? Reconstructing the historical time series at real-time to calculate those would be impossible timewise.

This schema would probably work great for backtesting, but i do no see why if someone puts the effort in building a historical database, eliminate its usage for real time statistics and analytics queries, or even for analysis on packages such as R.

amazingIndustry · Jul 12, 2012

I am a little confused with what you mean: I generally load tick data either from file or database and stream ticks sorted by time stamp into my testing engine, (which is the exact same strategy container as the one in which my live strategies run, different box of course but 100% identical framework in order to avoid having to port the code later on) and calculate whatever the strategy needs such as volatility measures on the fly in the same way a live strategy would. In that sense I think it has nothing to do with where you get the data from, database or flat files. I currently derive them from my own custom binary flat file storage which I am pretty sure no kdb query (I used kdb at my previous bank), hdf5 load, or other storage/query solution can achieve. I tested hdf5 and Berkeley DB and they fall massively behind the constant 5+ million ticks / second I can load, deserialize, merge into true time-stamp-sorted stream for my strategy container to consume. By the way the performance numbers are for unbatched messages/ticks but I only reach it in my own custom concurrent framework (use the brand new TPL Dataflow library in .Net)

I am sure I will stick to flat files for back test purposes, so the issue I see is merely related to how fast time series can be loaded into memory. Processing the time series (such as decoding the deltas, adjusting the series for splits, or other price adjusted measures) is an entirely different issue, imho, from loading the series into memory.

I think you were more thinking along the lines of querying the database directly for analytical purposes, correct? I would never do such in real-time for back testing or live trading purposes it can never reach the speed necessary. I rather see an advantage in being able to query the database in the post-strategy-run analysis stage. For example, you may have saved your executions and associated time stamps and would like to know how prices were distributed around your entry price post-trade. You could query the database for the prices subsequent to your trade entry and build a price/return distribution profile.

Quote from Makis:

Granted that I have never used a database for backtesting ( and in my humble opinion querying a database is not the proper tool to backtest ) I think that approach severely limits the database to a very specific usage and almost entirely eliminates the possibility of using it for other productive functions.

The problem with 'deltas' and databases is that, even if you can reconstruct the time series, it would not be feasible to do that at real time (that is what i mean other functions)

On the example above, what if the strategy requires the 30 day average realized volatility at 11:31, or the average 15 day average volume between 11:26 & 11:31, or vwap, or maximum runup, etc... ? Reconstructing the historical time series at real-time to calculate those would be impossible timewise.

This schema would probably work great for backtesting, but i do no see why if someone puts the effort in building a historical database, eliminate its usage for real time statistics and analytics queries, or even for analysis on packages such as R.

amazingIndustry · Jul 12, 2012

I focus on fx markets and generally I do not see the optimization opportunities you indicated. On busy days I capture about 200,000 ticks (quotes) per symbol, there is nothing to optimize about it. Fx data are bid and offer, only. I experimented with only storing bid and offer deltas , the offer being a delta from the bid itself. I found that the bottleneck at a certain level is not how fast you get quotes into memory (including the overhead of deserialization, building back the quote from deltas, merging multi-symbol streams into one consolidated time stamp sorted stream) but the real bottleneck becomes the strategy algorithms that consume the quotes, trades,.... I ended up storing binary representations of full bid/offer quotes which ended up to be 16bytes, including time stamp (down to microseconds). I currently max out at 5 million ticks/quotes per second, and that not at the data feed module but my strategy module, with my current quad core hardware on one single machine the strategies cannot chew through more than 5 million ticks a second no matter how many more ticks per second my data feed can provide on the producer side. The simple reason for that is that all indicators, all analytics necessary for the strategy engine to make trading decisions are calculated on the fly. It would defeat the whole purpose of such platform if I started to pre-calculate analytics, store them in files or databases and loaded those alongside the quote data in a backtest. That feels to me like a Python or R, Matlab approach to profiling and testing ideas. I want to run a testing platform that is nearly identical to the live trading box, no porting of code necessary.

Quote from PocketChange:

As long as you have captured and stored the underlying data or can create it by calculation your not missing much. I would argue a consolidated quote database more accurately reflects the market then standard T&S based data.

A generalized one size fits all approach may work but will not be optimized.

For your example you run queries to build the data structure you want. Set triggers and schedules to maintain it. The key for speed is limiting the data set, optimizing indexes and loading focus result sets as inmemory databases.

The inmemory processes in CPU cycles.

Terabyte repositories are I/O bound, both disk and LAN. If you want speed you Preprocess the historic data and optimize the data structures and indexes for your specific purpose / process.

There are infinite possibilities so rather than try the one size fits all approach or sequentially running tapes use tools to dynamically create, manage and maintain a focused and optimized data set.

The Data is platform agnostic... If you have a columnar tool that can process 5 million records a second...Think about how much faster it can run optimized data structures with 1/100,000 the records.