Why use a database?

PocketChange · Dec 9, 2012

More stats:

Splits 31996
Dividends 167156
Symbol Changes 102489
Trading Halts: 67408

Historic Symbols: 102489

hft_boy · Dec 10, 2012

Quote from PocketChange:

We are now able to locate and access any tick of any instrument nearly instantaneously (<10ms). The data is stored multiple times using different optimizations for accelerating performance.

With all due respect -- what is the point of having a database and being able to locate random ticks quickly? First of all, 10ms is not fast -- that's 100 ticks a second. Second of all, most backtesting is done over sequential ticks. That means that you don't really need fast random access, but fast sequential access.

Using flat binary files is blazing fast on modern computers. I can 'parse' (i.e. not actually do any computations with the data) 7 GB in about 7 seconds on a 2 GHz CPU. That's about 30 ns a message. So why use databases? They're bulky, slow, and unless you can give me a good reason, really not the right type of tool for most trading type applications.

dreturns · Dec 10, 2012

Quote from hft_boy:

With all due respect -- what is the point of having a database and being able to locate random ticks quickly? First of all, 10ms is not fast -- that's 100 ticks a second. Second of all, most backtesting is done over sequential ticks. That means that you don't really need fast random access, but fast sequential access.

Using flat binary files is blazing fast on modern computers. I can 'parse' (i.e. not actually do any computations with the data) 7 GB in about 7 seconds on a 2 GHz CPU. That's about 30 ns a message. So why use databases? They're bulky, slow, and unless you can give me a good reason, really not the right type of tool for most trading type applications.

I would assume some trading systems can get a bit more complex. Then other factors than disk access speed are important. I use Oracle for several reasons. Data Cache: Once data is in memory, a database can be tuned to keep needed data in memory with out having to build custom binaries. Read consistency: Being able to have multiple threads or jobs accessing and changing model state data and not worry about deadlocks and missed updates. PL/SQL: To me a much easier higher level access language than any other. Itâs slower than some, but the functionality and capability of a full blown Oracle DB with its functions, schema, and memory structures is powerful. Client access: Sharing data across various applications / jobs with out worry of custom access code for every client. Advanced Queues: First in & out queues ideal for dumping in large amounts of inserts so the client process does not get stuck waiting for processing. AQs are also good for having multiple threads access and process the same data and replicating across environments. High Availability: Simply very few other solutions offer as many HA options. If youâre building something on your desktop and it has a system board error or disk failure in the middle of the trading day, now what? Go to last nightâs carbonite backup or usb drive and startover? What did you just miss? Can you at least recapture the data feed for historical reasons even if you canât trade cause your sweating bricks rebuilding? Iâm not talking about a large data center applications; this is here at my house. So no, a DB is not for a laptop based C++ file parser, but that doesnât mean a DB is not the right tool for trading applications.

PocketChange · Dec 10, 2012

Locate and load streams of consolidated records for any instrument starting at any point in time in 10ms.

Our data structures reduce the record count to an average of 3% that of the fix messages. Queries are optimized to return result sets of just the actionable events your interested in aka bid/ask changes... correlation triggers etc.

If you currently have a message handler capable of processing 5 million messages / second you can filter out the fluff, skip to any point in time and feed your handler a stream of actionable events. Using the same backtest code/platform your analysis should complete in 3% of the time just with the reduction of data.

Obviously every ones use case is different but these structures were built out to serve our trading requirements: Mostly Pairs and Cross Exchange Arbs... audit executions.

Quote from hft_boy:

With all due respect -- what is the point of having a database and being able to locate random ticks quickly? First of all, 10ms is not fast -- that's 100 ticks a second. Second of all, most backtesting is done over sequential ticks. That means that you don't really need fast random access, but fast sequential access.

Using flat binary files is blazing fast on modern computers. I can 'parse' (i.e. not actually do any computations with the data) 7 GB in about 7 seconds on a 2 GHz CPU. That's about 30 ns a message. So why use databases? They're bulky, slow, and unless you can give me a good reason, really not the right type of tool for most trading type applications.

propseeker · Dec 10, 2012

Quote from PocketChange:

Bars25ms

25 ms? so, i'm guessing this was done using nanex as data source? they're the only ones i know who use that arbitrary timeslice. if you're not using them, i'm curious why you chose it.

We build out our bars differently using ask/bid changes as the trigger and not last trade data.

More...

so, what do you record then on the trigger? midpoint? if so, most illiquid things quote garbage a lot of times (around the open especially, and all the time if it's really illiquid), so mids/quotes are near useless. how do you get around this?

anyway, kudos. i'm sure this took a bit of work.

hft_boy · Dec 10, 2012

Quote from PocketChange:

Locate and load streams of consolidated records for any instrument starting at any point in time in 10ms.

Our data structures reduce the record count to an average of 3% that of the fix messages. Queries are optimized to return result sets of just the actionable events your interested in aka bid/ask changes... correlation triggers etc.

I see. Well I take it back. That is ****ing fantastic. I guess our 'use cases' are totally different

.

PocketChange · Dec 10, 2012

Our feed handlers are synced using meinberg NTP servers which also sync with CME NTP servers. Our clocks are synced and adjusted every minute but we still experience ms' of clock drift.

25ms is the best interval we can reliably coordinate and sync time between machines at different colos. There are also a few ms of order matching lag at the exchange 3 - 8 ms on Globex.

Our objective was an accurate consolidated representation of the market trading conditions... specifically ask / bid liquidity and when an instrument spreads out. We always have the option to rerun the message streams but its generally too slow.

We database 3 streams of pricing data for each instrument at 25ms slices along with supplemental calculated info:
Ask, Ask High, AskLow, AskSize
Bid, BidHigh, BidLow,BidSize
Trade, TradeHigh, Tradelow, Volume.

This resolution of data makes little difference at minute intervals with maybe illiquid instruments reporting more bars reflecting market price quotes.

The real power is at the second and subsecond level. ie. at 10:31:17.275 does Ask hit 127.32 before bid 126.92 or where is the price at 3:55pm

We developed a data structure that contains the forward path and back path of price movement inside the 10:31:17.275 record.

Consequently the answer can be derived instantaneously without any further queries or message processing.

Quote from propseeker:

25 ms? so, i'm guessing this was done using nanex as data source? they're the only ones i know who use that arbitrary timeslice. if you're not using them, i'm curious why you chose it.

so, what do you record then on the trigger? midpoint? if so, most illiquid things quote garbage a lot of times (around the open especially, and all the time if it's really illiquid), so mids/quotes are near useless. how do you get around this?

anyway, kudos. i'm sure this took a bit of work.

DeeDeeTwo · Dec 10, 2012

Quote from hft_boy:

Using flat binary files is blazing fast on modern computers. I can 'parse' (i.e. not actually do any computations with the data) 7 GB in about 7 seconds on a 2 GHz CPU. That's about 30 ns a message. So why use databases? They're bulky, slow, and unless you can give me a good reason, really not the right type of tool for most trading type applications.

Custom flat files are 10 to 100 times faster than databases...
And can be serialized in RAM for even greater speed.

You need an very high level of complexity in your data analysis...
To justify using SQL databases....
Which probably means you are overfitting.

Keep it simple... it's all about execution anyway.

Gizzz · Dec 10, 2012

Quote from DeeDeeTwo:

Custom flat files are 10 to 100 times faster than databases...
And can be serialized in RAM for even greater speed.

You need an very high level of complexity in your data analysis...
To justify using SQL databases....
Which probably means you are overfitting.

Keep it simple... it's all about execution anyway.

Agreed, I have a good experience with binary files on SSD, I use them like arrays.

seek(a*x + b*y + 0)
b1=read.decimal
seek(a*x + b*y + 1)
b2 = read.decimal

and hey, there is my bar

abattia · Dec 12, 2012

Quote from inflector:

I've got an eclectic background. Started programming in high school over 20 years ago writing futures trading systems. Had a bit of fame in my early twenties as a trader and then left for 15 years to start a few software companies.

One of them sold an embedded database which was the number one product on the Macintosh. I worked on the internals, disk access, etc. as well as the query optimization.

There is a huge difference in read time between a database and a binary file unless the database has been specifically optimized for large binary data storage (known as BLOBs in the business).

The reason is simple, even in a database with an efficient caching mechanism large data sets generally involve multiple reads from the disk because the data is split up into chuncks. Every separate read will take a while because on average it will require 1/2 of a rotation of the disk before the data comes under the read heads so the read can start.

Unlike almost every other aspect of computing, disk speeds have not followed Moore's Law. Disks are maybe 30 to 100 times faster than they were 20 years ago while computers are 10,000 times faster.

Even a 10,000 RPM disk takes 6 milliseconds to rotate. So you only get 167 rotations per second. That's a lot of time when computers are doing billions of instructions per second.

For tick data analysis the speed of reading the data is the determining factor for the speed of testing unless you have very inefficient code or are doing esoteric analysis.

So I suggest storing information about your data in a database but storing the physical data on the disk in raw binary files.

You can get acceptable performance from a database if you know what you are doing, however, you will always pay a performance penalty.

- Curtis

+1 (still good after all these years!)