Why use a database?

prophet · Oct 14, 2004

Quote from misctrader:

so, what is the best way to store, retrieve
high frequency minute or tick data for backtesting and doing analytics on it?
through a flat file with c++
or
through a db?
which way is faster?

Flat binary files with a fixed record length, and appropriate record length. I.e. Don't waste space and bandwidth with 8 byte doubles unless that precision is needed. Data representable in 1 or 2 byte records can really fly off the disk and into the CPU. Fixed length binary formats permit rapid seeking to blocks of data.

prophet · Oct 14, 2004

Quote from marist89:

Boy you guys like to write a lot of code. While you have to write 100's of lines of code, I have to SELECT * FROM tickData WHERE sample_dt > SYSDATE-10. Good luck to you.

Won't scale. Try doing that for tens of billions of data points.

prophet · Oct 14, 2004

Quote from choppystride:

1) Lack of main memory:
e.g. you want to evaluate a strategy on intraday data that spans several years and your memory can only store one week's worth of data at a time....Otherwise, like Gringinho said, I think the easiest and fastest way to go is to store your data in flat files and then stuff everything into main memory when you need to work with them.

Bad advice! What you describe is very inefficient memory-wise and does not scale. It is better to load parts of the data set into memory, do as much processing as possible, with cache efficient code, accumulate results in memory or disk, repeat....

A basic understanding of disk<->memory<->L* cache<->CPU bandwidths and latencies is necessary, as is the desire to experiment with code and use a profiler.

prophet · Oct 14, 2004

Quote from Gringinho:

You're so right!
I forgot to mention that I always think about tick-data and products like ES, EuroFX etc., but of course 1-min, 5-min or larger OHLC-data will not be so much of a challenge. Good point.

There's no free lunch. You are trading in computaional and algorithmic challenges for less opportunities design-wise.

prophet · Oct 14, 2004

Quote from Sparohok:

Tick data is hard to work with efficiently no matter what data structure you use.

Wrong in general. The problem must be approached differently. Ticks are no less efficient than an equal number of bars or any data points for that matter. If you are refering to the increased number of ticks versus bars over time, then yes there may be more computations. However, not all processing must process every tick. The first layer of processing can operate on every tick, and cache the results to memory or disk. The next layer can process results every 10 or 100 ticks, or every minute, etc. You can optimize the second layer while keeping the results of the first layer in cache. You could actually end up with less computional load by virtue of greater design opportunities with the higher frequency data.

IMHO, the only reason to stop using a transactional database as your backing store is if the total size of the database gets out of hand. So it depends on time resolution, number of symbols and how much history you want, and how much you want to spend on disks and backup media. So I guess we're back to "it depends on the application."

So computational efficiency doesn't matter?

It's not a matter of computing power, it's a matter of statistical validity. The institutionals do not calculate the covariance matrix of the S&P 500 because it would not be statistically valid.

That's wrong. They do calculate such matricies derrived from price changes and other technicals or statistics on many time frames. Not statisitically valid? That judgement depends purely on how the data is used. I have personally found success with 5000 stock * 1000 day covariance/correlation matrices of daily price changes and other daily statistics. Anyone who complains this isn't statistically significant enough simply hasn't found strong enough designs, signals or filters to achieve a desired profitability or stat sig.

There aren't enough degrees of freedom. Using more history gives you better statistical validity but worse predictive value since markets change over time. Roughly speaking, you would need 10,000 data points per stock for a valid covariance matrix of the S&P 500. That's almost 40 years of daily data.

Keep the huge data set. Allow some automatic walk-forward adaptation. Problem solved. Attempting to fit a fixed-rule based system over too much data can be an overwhelming challenge and a waste of time.

lastick · Oct 14, 2004

Quote from billgates:

From my journal:

"
------------------------------------------------------------------------
11 Nov 03
------------------------------------------------------------------------
Testing MySQL.

Populated with 101M intraday records. Time 5.3 hours, disk space 4.3G, index 2.4G.
Query to select 22000 records - 300 sec.
Query to select 1600 records - 3 sec.
Added 2 indexes ('open' and 'price') - 4 hours, 4.3G disk space.
"SELECT * FROM `rtquotes` where date='2003-11-06' and open>200 and open < 210" - 300 sec

Conclusion: not suitable (too slow)
"

Same query would cost me only about a second with flat files.

Absolutly,

2000 iterations with 3 variables on 81000 records : 10 secondes.(On a 2.7 Gz)

If one wants speed, one have to get a formula 1, not a truck. And speaking about real time financials apps, Sql stuffs on PCs are trucks, they are only suitable with mainframes and/or paralell computing.

prophet · Oct 14, 2004

Quote from lastick:

2000 iterations with 3 variables on 81000 records : 10 secondes.(On a 2.7 Gz)

2000 combinations * 81000 records / 10 seconds = 16.2M records*combinations per second. Not bad.

In Matlab and C on a 2.5GHz Pentium 4 it is possible to do:
6.7M ES ticks * 17 system variations * 1014 per-tick stop loss formula combinations / 16 minutes = 120M ticks*combinations per second.

I am currently doing operations like this on an dual Opteron 242 (1.6GHz):
10.2M NQ ticks * 750K system combinations / 10 hours, per processor = 208M ticks*combinations per second per CPU. Memory and VM usage is 500MB per process. Required disk bandwidth is very low over the 10 hour period. It reads a few MB of ticks every few minutes, then iterates through all the system combinations, then repeats. Most of the optimizations I use to achieve this are listed in this previous post: http://www.elitetrader.com/vb/showthread.php?s=&postid=594912#post594912

marist89 · Oct 14, 2004

Quote from billgates:

From my journal:

"
------------------------------------------------------------------------
11 Nov 03
------------------------------------------------------------------------
Testing MySQL.

Populated with 101M intraday records. Time 5.3 hours, disk space 4.3G, index 2.4G.
Query to select 22000 records - 300 sec.
Query to select 1600 records - 3 sec.
Added 2 indexes ('open' and 'price') - 4 hours, 4.3G disk space.
"SELECT * FROM `rtquotes` where date='2003-11-06' and open>200 and open < 210" - 300 sec

Conclusion: not suitable (too slow)
"

Same query would cost me only about a second with flat files.

That's funny, I can get 850K records out of my database of 500M records in less then a second. Of course, I run Oracle.

prophet · Oct 14, 2004

I'm not sure how database query speed can be accurately compared here. It depends on physical record size and internal organization. Does the query require a transpose or many drive head seeks? Is the data stored in a contiguous block?

One advantage of flat file formats is that you can choose the physical organization to optimize the most time consuming operation, be it reads or writes. Need to efficiently add to the data files as new market data comes in? Can design for that too.

Sparohok · Oct 14, 2004

Quote from prophet:
Ticks are no less efficient than an equal number of bars or any data points for that matter.

Tell me that after calculating the covariance of two sets of tick data.

The problem is not just the increasing amount of data, it is the lack of a consistant time scale. Algorithms are more complex and if you want to address the problem efficiently you need more complex data structures. The aggregation and caching techniques you describe just prove my point that tick data is harder to deal with than bar data.

Obviously there are both benefits and costs to using tick data. Whether that matters, well, that depends on the application.

In my application the only use I have for tick data is estimating trading costs.

Quote from prophet:
So computational efficiency doesn't matter?

Certainly not as much as good design, simplicity, and maintainability.

"Premature optimization is the root of all evil." Donald Knuth

Quote from prophet:
That's wrong. They do calculate such matricies derrived from price changes and other technicals or statistics on many time frames. Not statisitically valid? That judgement depends purely on how the data is used.

Of course it depends on how the data is used. But the poster asserted that institutionals don't do large covariance matricies for computational reasons. The real reason is significance, not computational complexity.

With the right methods you can get significant correlation data across large numbers of stocks, J.P.Morgan's RiskMetrics work is a classic example.

Martin