Thanks for all those valuable comments. I think I pretty much decided on the following:
Strictly splitting business logic between a) back testing / trading strategy related data management needs and b) data supply to run independent analytics, feed other unrelated apps,...
a) I decided to stick to my own custom binary time series data storage for backtesting purpose. All the back tests need is random access to a time series specified by start and end time stamp. I run each symbol of tick based data in a separate file and despite those being in the gigabyte region can access the time series at very fast speeds, faster than I have seen with any database I tried so far. Any analytical data that are output by the strategy (trades, TCA, risk, indicators, statistics) are stored in an in-memory database in order to query such data for research purposes post raw data iteration in the strategy modules. The query process is manual. In that there is a clear segregation between time series based data and more relational data on the other side.
b) I will spend more time on this later as I have not fully decided but currently I err on the side of a relational database. I do not need low latency access, only load couple 100 bars or 1000 ticks per symbol into some other applications or excel. I like to store tick and bar data of about 600-700 symbols in such database. What database would you recommend for such purpose? I want to access it through Excel, and .Net applications. As I need bars in Excel would you recommend a database that I can query with SQL and aggregate bars from ticks through an SQL query on the fly or store ticks and 1-minute bars in the database and aggregate bars from 1-minute bars given I do not need finer granularity bars? Would really appreciate some specific database recommendation for the specify example:
Data storage specs: 600-700 symbols, tick data, (fx, stocks, futures, indexes). Can store 1-minute bars as well. Storing about max 1 year history.
Read specs: No low latency requirements, want to read tick data between start and end time stamp of a given symbol; want to retrieve bars of any requested compression of a given symbol between start and end time stamp (would it be fast enough to compress tick data to bars on the fly through a query or will that be very very slow? ); want to query the average price of a given set of symbols at a specified time stamp.
Write specs: Need access to the previously closed 1-minute bar at the very least, so need to store either bars every minute (when the symbol in question is currently traded) or ticks in batches every minute.
Which database can match or exceed those specs? It does not sound extremely demanding to me, writing tick based data in batches every minute over 600-700 symbols could possibly be an issue for an RMDBS? Any ideas or suggestions? I emphasize access through Excel (thus SQL queries I guess) because its the lowest common denominator.
Thanks in advance for any recommendations.
Strictly splitting business logic between a) back testing / trading strategy related data management needs and b) data supply to run independent analytics, feed other unrelated apps,...
a) I decided to stick to my own custom binary time series data storage for backtesting purpose. All the back tests need is random access to a time series specified by start and end time stamp. I run each symbol of tick based data in a separate file and despite those being in the gigabyte region can access the time series at very fast speeds, faster than I have seen with any database I tried so far. Any analytical data that are output by the strategy (trades, TCA, risk, indicators, statistics) are stored in an in-memory database in order to query such data for research purposes post raw data iteration in the strategy modules. The query process is manual. In that there is a clear segregation between time series based data and more relational data on the other side.
b) I will spend more time on this later as I have not fully decided but currently I err on the side of a relational database. I do not need low latency access, only load couple 100 bars or 1000 ticks per symbol into some other applications or excel. I like to store tick and bar data of about 600-700 symbols in such database. What database would you recommend for such purpose? I want to access it through Excel, and .Net applications. As I need bars in Excel would you recommend a database that I can query with SQL and aggregate bars from ticks through an SQL query on the fly or store ticks and 1-minute bars in the database and aggregate bars from 1-minute bars given I do not need finer granularity bars? Would really appreciate some specific database recommendation for the specify example:
Data storage specs: 600-700 symbols, tick data, (fx, stocks, futures, indexes). Can store 1-minute bars as well. Storing about max 1 year history.
Read specs: No low latency requirements, want to read tick data between start and end time stamp of a given symbol; want to retrieve bars of any requested compression of a given symbol between start and end time stamp (would it be fast enough to compress tick data to bars on the fly through a query or will that be very very slow? ); want to query the average price of a given set of symbols at a specified time stamp.
Write specs: Need access to the previously closed 1-minute bar at the very least, so need to store either bars every minute (when the symbol in question is currently traded) or ticks in batches every minute.
Which database can match or exceed those specs? It does not sound extremely demanding to me, writing tick based data in batches every minute over 600-700 symbols could possibly be an issue for an RMDBS? Any ideas or suggestions? I emphasize access through Excel (thus SQL queries I guess) because its the lowest common denominator.
Thanks in advance for any recommendations.
Quote from Makis:
You got your requirements right, so for whatever is worth to you... I can tell you, you are on the right path, at least design-wise. Thats how the big boys do it, but they have the resources to put different teams/budgets on it. On one hand the real time capture team uses a column based database (kdb+, oneTick) to capture market data and a different team captures transactions on a relational database (oracle, Sybase)
The first database usually has 2 tables (trades,quotes) and queries never do joins, just extract part of the column and real-time calculate average spreads, realized volatility, intraday correlations, volume profiles, vwaps etc...
The transactions database has several tables and any meaningful query joins multiple tables.
The big difference is that the column based database is expected to capture thousands more messages per second than the relational database.
For people that try to build something at home, (usually after returning from work) this would be not be easy. I am not aware of a single product that can match the two database words, but in some kludgy way it can be done. I think that a column based database can accommodate your needs (need some crafty queries, that you will store as custom functions) but not the other way around.