Why use a database?

billgates · Oct 14, 2004

Quote from prophet:

2000 combinations * 81000 records / 10 seconds = 16.2M records*combinations per second. Not bad.

In Matlab and C on a 2.5GHz Pentium 4 it is possible to do:
6.7M ES ticks * 17 system variations * 1014 per-tick stop loss formula combinations / 16 minutes = 120M ticks*combinations per second.

I am currently doing operations like this on an dual Opteron 242 (1.6GHz):
10.2M NQ ticks * 750K system combinations / 10 hours, per processor = 208M ticks*combinations per second per CPU. Memory and VM usage is 500MB per process. Required disk bandwidth is very low over the 10 hour period. It reads a few MB of ticks every few minutes, then iterates through all the system combinations, then repeats. Most of the optimizations I use to achieve this are listed in this previous post: http://www.elitetrader.com/vb/showthread.php?s=&postid=594912#post594912

I'm doing 250M ticks * combinations with a $60 Celeron.
Are you using c++ or python?

prophet · Oct 14, 2004

Quote from Sparohok:

Tell me that after calculating the covariance of two sets of tick data.

Tick delimited time scales are not usually appropriate for calculating covariances. It is better to index the ticks by some time interval, accumulating or averaging the tick delimited information into time-delimited bins. Then do the covariance. Youâre also not tied to fixed-time length bins. Use whatever temporal stepping you choose. Skip bars. Use volume weighted bar length. I like a tick-time hybrid stepping. The point is that ticks allow you to look at any stepping, and you can natively analyze per-tick fundamentals which would get lost with typical time delimited data.

The problem is not just the increasing amount of data, it is the lack of a consistant time scale.

It's an easy problem to solve. Take your tick data, calculate the time indicies of your choice, index each tick into a second array, accumulating or averaging if you allow multiple ticks per index. Now process with the second array instead. This is a very fast and simple process.

The aggregation and caching techniques you describe just prove my point that tick data is harder to deal with than bar data.

Aggregation and caching is completely independent of tick data. Use the techniques for bar data if you want. These are merely efficiency enhancements, not a crutch for processing tick data because tick data is somehow so difficult to analyze.

Obviously there are both benefits and costs to using tick data. Whether that matters, well, that depends on the application. In my application the only use I have for tick data is estimating trading costs.

Yes, it is very important to accurately simulate executions. Too many system traders constrain themselves to bar data out of fear of complexity, when they really could use inter-bar execution simulation, not to mention tick-based analysis which has unlimited possibilities.

prophet · Oct 14, 2004

Quote from billgates:
I'm doing 250M ticks * combinations with a $60 Celeron.
Are you using c++ or python?

That's excellent! Care to share any other details? I use Matlab and C (MEX functions).

kc11415 · Oct 14, 2004

marist89>...I can get 850K records out of my database of 500M records in less then a second. Of course, I run Oracle.

That sounds rather good.

May I ask:

1) Is your timing of this query fresh after the database is started? Or, is it after some of these 850K rows have had time to be accessed by other queries such that many still are still cached in the disk buffer? You're not running this query multiple times and then reporting the later/faster result?

2) You're not by chance pinning anything in memory in the KEEP POOL?

3) What's the datatype of the column(s) in the WHERE clause?

4) Are you using a hash or bitmap index for the column(s) referenced in the WHERE clause?

5) Do you have a composite index exactly matching the conditions referenced in your WHERE clause, or do you have individual indexes on each separate column referenced in the WHERE clause?

6) Are you using table partitioning? If so, do(es) the condition(s) in your WHERE clause relate to the column(s) used to specify partition boundaries?

7) Is it safe to assume that you are not using an ORDER BY clause? If so, how do you ensure rows come out in the same order on every query? Just because they are inserted in a particular order is no guarantee they will come out in the same order.

8) Do you avoid UPDATES, or else INSERTS after DELETES on this table, which can lead to automatic free space management in the form of coalescing or row migration? (which can change the order of rows)

prophet · Oct 14, 2004

Quote from Sparohok:

Certainly not as much as good design, simplicity, and maintainability

Do you even realize what you are saying? Computational efficiency does matter when you are talking about 1 to 4 orders of magnitude improvements. Who wouldn't like to have the equivalent of 10 to 10,000 times as much computing power?

In the end what matters most is how fast systems are designed and deployed to make money at acceptable risk, before the market evolves and makes the systems obsolete. We use computational tools to find profitable system designs because it is faster than doing the math by hand. Why limit yourself out of convenience? The faster you search, the faster you find. I would never rank simplicity higher than testing performance. Good design is very important, but needs to be defined. Maintainability is questionable. Often the most efficient designs are only good for one purpose, or have ugly, hard to maintain code. So there are always trade offs. All things being equal, if you rank efficiency too low you may severely reduce your chances of success in a reasonable period of time.

choppystride · Oct 14, 2004

Quote from choppystride:

1) Lack of main memory:
e.g. you want to evaluate a strategy on intraday data that spans several years and your memory can only store one week's worth of data at a time....

Otherwise, like Gringinho said, I think the easiest and fastest way to go is to store your data in flat files and then stuff everything into main memory when you need to work with them.

Quote from prophet:

Bad advice! What you describe is very inefficient memory-wise and does not scale. It is better to load parts of the data set into memory, do as much processing as possible, with cache efficient code, accumulate results in memory or disk, repeat....

A basic understanding of disk<->memory<->L* cache<->CPU bandwidths and latencies is necessary, as is the desire to experiment with code and use a profiler.

Huh? Are you referring to the "lack of main memory" part or the "otherwise" part?

prophet · Oct 14, 2004

Quote from choppystride:

Huh? Are you referring to the "lack of main memory" part or the "otherwise" part?

Sorry for the confusion. I realized my mistake but missed the 60 minute post-editing window.

I was only criticizing the lack of main memory issue and the advice to stuff all data into memory. I agree with the part about using flat files. Sorry!

billgates · Oct 14, 2004

Quote from kc11415:

marist89>...I can get 850K records out of my database of 500M records in less then a second. Of course, I run Oracle.

That sounds rather good.

Duh good it is... What's your hardware, OS, Oracle version?

harrytrader · Oct 14, 2004

lastest financial db tech allows billions of records analysed in seconds.

billgates · Oct 14, 2004

Quote from prophet:

That's excellent! Care to share any other details? I use Matlab and C (MEX functions).

I preselect 20-30K ticks, that fits nicely into 256K L2 cache.
All integer arithmetics (no floats), hand optimized.