Tick Database Implementations

dcraig · Jan 25, 2007

I am going to be a bit of a Luddite here and ask the question - Are these streaming DBs really that useful ? And also ask if somebody can provide an example of how they are useful.

As I find if difficult to talk about sw technology in the abstract, I will briefly discuss a little application I have written in Java as an illustration.

The application is a real market scanner (or alarm raiser if you like). You can set criteria like

Code:

(SMA10 > SMA50) AND (RSI7 < 30) AND (VOLUME > 1.5 * SMA5 (VOLUME) OR ......

In other words relational expressions of arbitary complexity involving any of the supported time series "functions". The expressions can include a mix of time series with different periods (eg 1min or 5 min bars etc) or even time series with constant volume or constant tick bars. Multiple simultaneous screens are supported.

On each tick every screen is evaluated for the instrument the tick occurred on. The largest universe of stocks I have tried so far is components of the NDX. Performance is very, very good. Half a dozen concurrent screens show almost no CPU utilization on an old Athlon 2800 XP (Socket A !). The components of the SPX should be no problem at all.

Implementation - Historical data is stored in MySQL tables and loaded into arrays at startup. (Arrays of doubles not Doubles). These arrays are encapsulated in TimeSeries objects. Arrays are grown as required. I have tried to avoid locking and it seems that I have been successful. TimeSeries objects have an event notification mechanism. There are basically two time series events - LAST_BAR_CHANGED and BAR_APPENDED. Event listeners listen for these eg an SMA class listens for these events on its input and recalculates the last bar or appends a new bar to itself and notifies its listeners. So a indefinate number of subclasses of TimeSeries can be chained together via event notification.

Other time series listeners include charting and the real time scanner.

It seems to me that there are lots of performance advantages in using in-memory arrays, not the least of which is that portion of the time series for which the last bar is being calculated should be in processor L2 cache for the duration of the calculation - and loaded into cache with only one memory access. I find it hard to believe that anything else (eg most types of Collections or a streaming database) is going to achieve this.

The problem of course is limited memory. Solution - buy more memory and allocate a big swap partition and let the OS virtual memory system do it's job. If we are talking analysis (eg backtesting) this should be ok too because of the sequential nature of access to the arrays should result in fairly orderly paging. Address space no issue with 64bit CPUs.

So here is the question - what would a streaming data base do for me ? And how would it match the performance ?

Sparohok · Jan 26, 2007

Whenever I see ktmexc20 hyping HDF5 I have to give my own contrary viewpoint. In my own experience, it's not a good solution for financial timeseries databases.

Sure, HDF5 makes a lot of sense in the applications for which it was designed - storing extremely large, multidimensional scientific datasets for collaboration and analysis. As far as I can tell, though, HDF5 makes no sense for storing financial data. I tried it out via the pytables interface, and all it gave me was trouble. It wasn't designed for financial timeseries, and IMHO it doesn't solve any of our problems.

If you're transitioning from a more traditional database to HDF5, sure it's blindingly fast. But other than space allocation it really doesn't do anything that a database should do. What you should really be comparing it to is a filesystem, because that's essentially what you get: a heirarchical namespace and a storage allocator. In my personal experience, marshalling your own binary files in a modern filesystem will be more reliable and faster than HDF5, and probably even easier to use. Whereas, if you really need the properties of a database rather than a filesystem, HDF5 won't solve your problems.

For the HDF5 proponents, I have to ask, what does HDF5 actually do for you? What benefit does it have for storing financial data?

Martin

arr999 · Jan 27, 2007

Vhayu has the best technology out there for tick database implemenations .... many shops go with them .... specially on the hi-freq arb side of biz

vhayu.com

rosy2 · Jan 27, 2007

Quote from arr999:

Vhayu has the best technology out there for tick database implemenations .... many shops go with them .... specially on the hi-freq arb side of biz

vhayu.com

how do you know many shops go with them .... specifically the hi-freq arb side of biz uses this?

nitro · Jan 29, 2007

The LINQ project from MSFT is worth watching closely. Note that it is extensions to C# in version 3.0 (and VB) that make the Object<-->Relational mapping "possible".

What is really interesting is the effect that Haskell (functional programming) has had on the project.

Read more...

http://msdn2.microsoft.com/en-us/netframework/aa904594.aspx

and

http://blogs.msdn.com/aconrad/archive/2007/01/09/the-haskell-road-to-enlightenment.aspx

Looks like the MSFT C# team is being pushed by other MSFT teams to take all the great features in highly regarded languages, and intergrate them into the MSFT .Net languages, particularly C#.

nitro

ktmexc20 · Jan 29, 2007

Quote from Sparohok:

Whenever I see ktmexc20 hyping HDF5 I have to give my own contrary viewpoint. In my own experience, it's not a good solution for financial timeseries databases.

Sure, HDF5 makes a lot of sense in the applications for which it was designed - storing extremely large, multidimensional scientific datasets for collaboration and analysis. As far as I can tell, though, HDF5 makes no sense for storing financial data. I tried it out via the pytables interface, and all it gave me was trouble. It wasn't designed for financial timeseries, and IMHO it doesn't solve any of our problems.

If you're transitioning from a more traditional database to HDF5, sure it's blindingly fast. But other than space allocation it really doesn't do anything that a database should do. What you should really be comparing it to is a filesystem, because that's essentially what you get: a heirarchical namespace and a storage allocator. In my personal experience, marshalling your own binary files in a modern filesystem will be more reliable and faster than HDF5, and probably even easier to use. Whereas, if you really need the properties of a database rather than a filesystem, HDF5 won't solve your problems.

For the HDF5 proponents, I have to ask, what does HDF5 actually do for you? What benefit does it have for storing financial data?

Martin

Hi Martin,

How is what we use in financial data any different from that of a scientific data set? Why don't most scientists or researchers utilize relational db models?

I'd argue, merely from my current understanding, that relational db's are most useful in cross-reference applications which I can surely see being needed in a business enterprise-IT environment. Though cross-referencing is surely used for some aspects of financial/scientific purposes, it's not a foundational requirement (as it is with enterprise IT). But, when useful, hdf5 does provide such semantics with flexibility.

I think the bottom line is that SQL (relational db) is simply a structured, higher level interface that accommodates convenience and productivity. But in doing so, as with any other higher level structure, latency is inherent by it's abstraction. Hdf5 provides flexibility, though as with any other interface providing flexibility (and speed)... some custom interface construction and implementation is required on the developer's part.

For most any of today's languages, there is a trade-off between flexibility/speed and convenience/productivity.

Sparohok · Jan 29, 2007

Quote from ktmexc20:

How is what we use in financial data any different from that of a scientific data set? Why don't most scientists or researchers utilize relational db models?

For the guys who developed HDF5, relational databases aren't even on their radar screens. They're dealing with vast unstructured multidimensional data sets. The main things they do with their data is visualization and numerical analysis. Relational databases aren't much help there and even if they were, they would be many orders of magnitude too slow.

Financial data falls somewhere in between business datasets and supercomputing datasets. Financial timeseries data are arguably more like scientific data than business data particularly in the analysis requirements (mostly statistics & vector math rather than searching and cross-referencing). But there are two big differences:

1) The unit of analysis for financial data (e.g. a single timeseries) almost always fits easily in RAM on a workstation. The unit of analysis for supercomputing application seldom fits in RAM on a workstation.

2) Our data is generally one dimensional, supercomputing data is generally three or higher dimensions.

Most of the unique functionality of HDF5 is intended to address the above two problems. Since we don't generally face those problems, for us, HDF5 is reduced to an overly complex and baroque filesystem.

Sadly there is one other difference that makes HDF5 particularly ill suited for many financial applications:

3) Financial datasets are often used in streaming, multithreaded, multiuser environments where concurrency, atomicity, and locking are essential. Supercomputing datasets are generally produced and then analyzed sequentially.

HDF5 doesn't provide transactional guarantees for either application data or metadata. It's very easy to destroy the entire dataset unless you are strictly single-threaded and single-user.

Filesystems at least provide transactional metadata, although the application data is your own problem. True databases provide ACID guarantees for both data and metadata, but you pay through the nose in performance.

Martin

ktmexc20 · Jan 29, 2007

Martin, just to clarify, my comments are never intended to inflame, but just offer debate of what I understand. Always, with total respect to yours and others' points of view.

-kt

Quote from Sparohok:

For the guys who developed HDF5, relational databases aren't even on their radar screens. They're dealing with vast unstructured

How is it unstructured? I would say that the organization of the data is quite structured, as you yourself said, similar to a hierarchical file system model... within the file itself.

multidimensional data sets. The main things they do with their data is visualization and numerical analysis. Relational databases aren't much help there and even if they were, they would be many orders of magnitude too slow.

Right, I think that's my point.

Financial data falls somewhere in between business datasets and supercomputing datasets. Financial timeseries data are arguably more like scientific data than business data particularly in the analysis requirements (mostly statistics & vector math rather than searching and cross-referencing). But there are two big differences:

1) The unit of analysis for financial data (e.g. a single timeseries) almost always fits easily in RAM on a workstation.,

I think that's obviously in regard to one's needs. Just for the E-mini alone, I currently have over 7 Gb of tick/scaled data I work with..

The unit of analysis for supercomputing application seldom fits in RAM on a workstation.

2) Our data is generally one dimensional, supercomputing data is generally three or higher dimensions.

This is not true because it's primarily a one dimensional structured data-type (iow, a table array) where that struct could very well in itself be of a rank of tens/hundreds/thousands of dimensions. Once again depending on your analytical needs. Hdf5 is highly optimized for table data.

Most of the unique functionality of HDF5 is intended to address the above two problems. Since we don't generally face those problems, for us, HDF5 is reduced to an overly complex and baroque filesystem.

According to one's needs, that could very well be true. But not for any type of elaborate analysis that I think the gist of this thread is referring to, imo.

Sadly there is one other difference that makes HDF5 particularly ill suited for many financial applications:

3) Financial datasets are often used in streaming, multithreaded, multiuser environments where concurrency, atomicity, and locking are essential. Supercomputing datasets are generally produced and then analyzed sequentially.

I respectfully and totally disagree with this. Who says that "Supercomputing datasets are generally produced and then analyzed sequentially". Sorry, but that's really kind of absurd.

HDF5 doesn't provide transactional guarantees for either application data or metadata. It's very easy to destroy the entire dataset unless you are strictly single-threaded and single-user.

I'm not able to debate this issue with you, but I believe that is wrong as well.

Filesystems at least provide transactional metadata, although the application data is your own problem. True databases provide ACID guarantees for both data and metadata, but you pay through the nose in performance.

I've heard of ACID, but am not familiar with it.

All I can rebut in this regard is:
Why do so many mission-critical applications use HDF5?
For example: Boeing, and the Jet Propulsion Laboratory, etc...

Sparohok · Jan 29, 2007

How is it unstructured? I would say that the organization of the data is quite structured, as you yourself said, similar to a hierarchical file system model... within the file itself.

The type of datasets HDF5 is designed for are likely to have relatively simple internal structure - a few large multidimensional arrays, large tables, etc. Whereas relational databases are used to represent complex internal structure. To put it another way: the schema of a scientific dataset is generally far simpler than the schema of a business dataset.

I found that HDF5 performance degraded seriously when I tried to create on the order of tens of thousands of tables. It would take many minutes to open and close the file. This clearly wasn't what it was designed for. Whereas filesystems and databases can handle that kind of complexity with ease.

I think that's obviously in regard to one's needs. Just for the E-mini alone, I currently have over 7 Gb of tick/scaled data I work with..

I'm sure you have no trouble reducing that to an in-memory representation for analysis. Whereas supercomputing results are often terabyte scale.

Neither the size nor the dimensionality alone are the problem. It is the combination of size and dimensionality that makes it modestly difficult to stride through a dataset -- difficult enough to be worthwhile using a library. Notice how much effort the the HDF5 APIs put into on slicing high dimensional arrays. That's the sort of problem I would use HDF5 to solve.

This is not true because it's primarily a one dimensional structured data-type (iow, a table array) where that struct could very well in itself be of a rank of tens/hundreds/thousands of dimensions.

That's still linear access. It's still a one dimensional datatype. You access it with a single index.

Even if you make your struct indexable, it's still only a two dimensional array. Any individual cell can be located with at most two quantities.

But not for any type of elaborate analysis that I think the gist of this thread is referring to, imo.

How is HDF5 well suited for more elaborate financial analysis? Can you give me an example? Hypothetical is fine.

Who says that "Supercomputing datasets are generally produced and then analyzed sequentially". Sorry, but that's really kind of absurd.

Almost everything NCSA does is batch scheduled. Data collection, computation, and presentation of results are separate phases.

Stock traders generally demand data collection, computation, and and presentation of results to occur constantly and simultaneously. This is more like OLTP (online transaction processing) and almost nothing like scientific computation. Transactional databases and filesystems were designed for this sort of workflow, HDF5 was not.

Why do so many mission-critical applications use HDF5?
For example: Boeing, and the Jet Propulsion Laboratory, etc...

I doubt they're using it for online data processing.

Looking here:

http://hdf.ncsa.uiuc.edu/users5.html

Almost everything in that list is traditional batch scientific computation.

Martin

ktmexc20 · Jan 30, 2007

Martin, thanks for your comments.

Here on ET, I advocate the use of hdf5 for several reasons. First, is to inform and offer an alternative to relational database models. This is because, I treat trading as a science, and therefore I use the most adequate, open source, runtime-efficient database for such data. My understanding, from what I've learned, is that that model/implementation is hdf5.

The emphasis for my code is on run-time efficiency and secondarily on a convenient high level interface. I am not aware of another free, open source database that will top hdf5 or it's hierarchical model for storing, analyzing, and utilizing (both off-line and in real-time) financial data. If I can be persuaded as to something better I'd happily invest the time to work with that one instead.

A profiling comparison would satisfy some of our debate. Unfortunately, I'm not aware of any studies, and I can't perform them myself at this time. It would be great if someone were to be able to provide some profiling or other evidence. I think that Nitro mentioned something about a comparison in another thread, but I know that he's also very busy. So I guess, for the time being, there it rests.

Lastly, I wish all much success in what ever model/application they might prefer to use.

-kt