Datastore design recommendations

jackfx · Jan 26, 2022

nooby_mcnoob said:
Memory mapped files are fine, and if you structure your files such that the important stuff doesn't require traversing the files you'll be good. But with every solution there is a "what are you doing with this" requirement. For me, I needed to use it for research (which meant slicing and dicing) as well as real time decisions.

Note that I have since switched to manual trading so what I say should be taken with a grain of salt anyway, though I still use it for research and I can load ticks for thousands of symbols quickly enough for research.

When algo/strategies are changing then use whatever tools to profit. When use SQL then create index to speed up queries. Cloud storage provides SQL but not sure it will automatically create indexes based on queries patterns. Text, Excel, NoSQL, AI/ML, exotic, Cloud all are good when algos/strategies are changing. Any tools to show anomalies are welcomed

jackfx · Jan 26, 2022

Clark Bruno said:
I like to focus more on the single question whether users who use columnar databases store all their symbols of a specific structure in one table with an identifier column or whether they set up different tables. Also eager to hear of completely different approaches as long as it deals with columnar database storage.

Short answer: one symbol one table, one algo/strategies/feature/exotic in one table. So basic tick in one table, and tick with secret sauce in other table, so it occupies [1 + (secret qty)] tables. Should Cloud is smart enough, actually it can see SQL query patterns and build indexes behind so user not aware but amazed by performance. For me personally, it's a just a selling point of Cloud company but behind is still Indexes and flat tables (with proprietary storage mgt). Let me know there's such engine, quite amazed it's free of charge and local storage.
When in research stages, just use whatever fastest tools to produce view/graphs/anomalies. Only focus on Storage vs Query when profit. But unfortunately, algo may not last long or improving (for some) thus require Both Storage and Query as fast. So find Cloud and Smart engine to see your queries pattern BUT your SECRET SAUCE are bring STOLEN by hackers (if the engine in cloud)

nooby_mcnoob · Jan 26, 2022

jackfx said:
Short answer: one symbol one table, one algo/strategies/feature/exotic in one table. So basic tick in one table, and tick with secret sauce in other table, so it occupies [1 + (secret qty)] tables. Should Cloud is smart enough, actually it can see SQL query patterns and build indexes behind so user not aware but amazed by performance. For me personally, it's a just a selling point of Cloud company but behind is still Indexes and flat tables (with proprietary storage mgt). Let me know there's such engine, quite amazed it's free of charge and local storage.
When in research stages, just use whatever fastest tools to produce view/graphs/anomalies. Only focus on Storage vs Query when profit. But unfortunately, algo may not last long or improving (for some) thus require Both Storage and Query as fast. So find Cloud and Smart engine to see your queries pattern BUT your SECRET SAUCE are bring STOLEN by hackers (if the engine in cloud)

I suspect if you tell the SQLite guys that their database isn't as fast as the cloud one because it build indexes in the background based on access patterns, they will implement that functionality. Someone figure out how to make Dr. Hipp jealous that cloud does querying better.

Databento · Jan 26, 2022

Clark Bruno said:
Hello everyone, my first post.

I am an algorithmic trader and have so far stored my temporal sequence data in my own binary file data store. I am in the midst of changing parts of my architecture. I came across Clickhouse, an open source highly performant time series database and have one question, as I have never stored tick based data that I use for back testing in a conventional database:

I could import my currency tick based data for one symbol to experiment with performance optimizations. The schema is simply : timestamp (long), bid (float64) and ask (float 64), altogether three columns in the table. Do you now recommend to create separate tables for each symbol or would you store all symbols of an identical asset class in the same table? One symbol alone is around 370 million records. Adding 30 or more symbols for all my currencies is not a big deal, performance wise. The database can deal with many billions of records. But would this be advisable?

What do you guys do, particularly those who work with columnar databases?

There's a fair bit of misinformation here. We maintain a small, 9 node Clickhouse cluster (40 TB) here at Databento for analytics. We also have experience with kdb. And we have a fair amount of experience storing order book and tick data (7+ PB raw and growing).

Clickhouse will perform very well against other column-oriented databases such as Vertica and kdb when compared on the same single node, because Clickhouse exploits many of the same hardware codesign principles, e.g. SIMD, instruction pipelining, linear memory traversal patterns, inline compression, page-sized optimized index structures etc. It also has the distinct advantage of being free, so you can spend your budget towards clustering more hardware for more performance. For our target writes and queries, our Clickhouse cluster outperforms a kdb setup of same cost.

Some selling points of kdb in this situation would be (a) more flexible architecture if you want to support multiple use cases besides backtesting and real-time capture and (b) more powerful query language. You can solve least squares on server side on a massive design matrix with q in a single line, but you can't do that with Clickhouse.

Before asking if you should store the data in one large table, first ask if you have a good reason to store your data on a column-oriented database. Do you need:
- Flexibility to write any arbitrary queries? e.g. Online queries, summary statistics etc.
- To push compute to the data? As you've pointed out, your workflow could be slower on client side than server side.
- Typically run queries that have smaller qualifying results than source data?

The last criterion is probably most important here in answering your original question. A column-oriented database is a good idea when you have patterns where you filter and/or aggregate on columns, largely because it exploits linear scan on the filtered column and late materialization and deferred tuple construction on the (intermediate) results. This partially answers your question: if you already had a good reason to store the data in a column-oriented database, then you will probably have a good reason to store all of the symbols in 1 single large table to exploit things that it is good at - filtering on the symbol column, range selection on the time column and aggregating on other columns. Backtesting usually fails this criterion, because you're likely just seeking through contiguous data from start to end without any qualification.

The other strong reason to store all of your symbols in one large table is an operational one: Adding 1 table each time you add an instrument is not very maintainable. And joining across them is expensive. And prematurely paying higher maintenance cost before you've found a repetitive use case is usually a bad idea.

You may also find these popular posts by another one of our engineers useful:
- What's the most efficient way to store options and time series data for backtesting?
- Alternatives to RDBMS for options backtesting

Clark Bruno · Jan 26, 2022

Best post in this thread so far. And I agree, most posts spout misinformation or are perhaps just a reflection of the lack of knowledge how to deal with large datasets. I have outlined my reasons why I need a columnar data store and the type of queries I run.

But in the end of the day the data has to be stored and queried somehow, so the most important question that needs to be answered first is what is the RELATIVELY best and most optimal way to store such data. Text files are completely out for logical and simple reasons. They are just not feasible for any queries. They don't even perform well for back testing over large amounts of data, streaming and parsing from strings is just way too slow to achieve an ingest rate of millions of data points per second that are required to run large backtests within a reasonable amount of time (CH reads entire partitions and does so multi threaded which still allows me to stream the data at over 5 million ticks per second)

Memory mapping is from my experience just a misconceived concept. Either the data is on file or in memory. You can't have the benefits of either at the same time.

Relational databases are not designed for the queries I intend to run and do not provide the performance necessary for temporal sequence data.

This leaves me with binary flat file storage which I have used for years or with a columnar database.

In the end I decided to switch over to a columnar database because the particular one (CH) already is widely adopted enough that it provides APIs for the languages I use most often. With binary flat files the serialization/deserialization mechanisms from one language to another are different, endianess is an issue, the way the data is parsed to arrays or pocos is different. I just found it way easier to go with a well developed open source tsdb than continuing to go down the road of binary flat files.

Your sharing added some valuable insights. Thanks for that. I was doubtful that anyone would set up thousands of tables for the entire equity universe in the US alone. It makes no sense and cross join queries would kill all performance.

Databento said:
There's a fair bit of misinformation here. We maintain a small, 9 node Clickhouse cluster (40 TB) here at Databento for analytics. We also have experience with kdb. And we have a fair amount of experience storing order book and tick data (7+ PB raw and growing).

Clickhouse will perform very well against other column-oriented databases such as Vertica and kdb when compared on the same single node, because Clickhouse exploits many of the same hardware codesign principles, e.g. SIMD, instruction pipelining, linear memory traversal patterns, inline compression, page-sized optimized index structures etc. It also has the distinct advantage of being free, so you can spend your budget towards clustering more hardware for more performance. For our target writes and queries, our Clickhouse cluster outperforms a kdb setup of same cost.

Some selling points of kdb in this situation would be (a) more flexible architecture if you want to support multiple use cases besides backtesting and real-time capture and (b) more powerful query language. You can solve least squares on server side on a massive design matrix with q in a single line, but you can't do that with Clickhouse.

Before asking if you should store the data in one large table, first ask if you have a good reason to store your data on a column-oriented database. Do you need:
- Flexibility to write any arbitrary queries? e.g. Online queries, summary statistics etc.
- To push compute to the data? As you've pointed out, your workflow could be slower on client side than server side.
- Typically run queries that have smaller qualifying results than source data?

The last criterion is probably most important here in answering your original question. A column-oriented database is a good idea when you have patterns where you filter and/or aggregate on columns, largely because it exploits linear scan on the filtered column and late materialization and deferred tuple construction on the (intermediate) results. This partially answers your question: if you already had a good reason to store the data in a column-oriented database, then you will probably have a good reason to store all of the symbols in 1 single large table to exploit things that it is good at - filtering on the symbol column, range selection on the time column and aggregating on other columns. Backtesting usually fails this criterion, because you're likely just seeking through contiguous data from start to end without any qualification.

The other strong reason to store all of your symbols in one large table is an operational one: Adding 1 table each time you add an instrument is not very maintainable. And joining across them is expensive. And prematurely paying higher maintenance cost before you've found a repetitive use case is usually a bad idea.

You may also find these popular posts by another one of our engineers useful:
- What's the most efficient way to store options and time series data for backtesting?
- Alternatives to RDBMS for options backtesting

MrMuppet · Jan 26, 2022

Databento said:
There's a fair bit of misinformation here. We maintain a small, 9 node Clickhouse cluster (40 TB) here at Databento for analytics. We also have experience with kdb. And we have a fair amount of experience storing order book and tick data (7+ PB raw and growing)....

A sponsor that actually provides valuable information instead of just shilling. Thumbs up

Databento · Jan 26, 2022

No problem.

Clark Bruno said:
In the end I decided to switch over to a columnar database because the particular one (CH) already is widely adopted enough that it provides APIs for the languages I use most often. With binary flat files the serialization/deserialization mechanisms from one language to another are different, endianess is an issue, the way the data is parsed to arrays or pocos is different. I just found it way easier to go with a well developed open source tsdb than continuing to go down the road of binary flat files.

Well, these days we see most of the top trading firms using a binary flat file format on POSIX storage for their backtesting workflows, and scale their IO with a distributed file system like Weka, VAST or Lustre. But as you've pointed out, that takes quite a bit of work. You could consider parquet for the position that you're in.

But otherwise - to shill a little - we are open-sourcing our binary flat file format along with our Python client library and documentation in about 1 month's time (and C++ library later) that will do a fair bit of that work for you. It is quite identical to the internal formats used at top prop firms. Tested briefly with 4,010,830 entries of MBO data on my 2020 MacBook Air:

- Writes at 1.3 GB/s.
- Reads at 3.5 GB/s (200+ million events per second).
- Takes up 71 MB on disk. Compared to 81 MB (HDF5, gzip, complevel=7), 223 MB (HDF5, gzip, complevel=0), 97 MB (parquet, snappy).
- Peak memory use at 192.7 MB. Compared to 321.5 MB for pandas/pyarrow.
- We include much of the serialization/deserialization for you in the client libraries. In CPython:

- Takes about 6.436 seconds to convert to pandas dataframe.
- Also converts to list type about as fast.
- Full order book replay at 154k events per second dispatching a pass-through callback. (And much faster in our internal C libraries and on our storage cluster.)

- Same data structure used in our real-time and historical data services, which also transcode from binary to CSV or JSON (and parquet later) on server side. And we do have coverage of FX order book and tick data.

Clark Bruno · Jan 26, 2022

Those are quite impressive numbers. Though from my experience most back test engines become the bottleneck at around 10mil data points per second when the data feed could actually stream the data a lot faster. The more complex the risk checks and actually algorithm complexity the more this weighs down on performance beyond 10mil/sec. Though hft algorithms are fairly straight forward in nature so I would not be surprised if their testing harnesses performed better. I don't play in that space... so, hard for me to judge.

Databento said:
No problem.

Well, these days we see most of the top trading firms using a binary flat file format on POSIX storage for their backtesting workflows, and scale their IO with a distributed file system like Weka, VAST or Lustre. But as you've pointed out, that takes quite a bit of work. You could consider parquet for the position that you're in.

But otherwise - to shill a little - we are open-sourcing our binary flat file format along with our Python client library and documentation in about 1 month's time (and C++ library later) that will do a fair bit of that work for you. It is quite identical to the internal formats used at top prop firms. Tested briefly with 4,010,830 entries of MBO data on my 2020 MacBook Air:

- Writes at 1.3 GB/s.
- Reads at 3.5 GB/s (200+ million events per second).
- Takes up 71 MB on disk. Compared to 81 MB (HDF5, gzip, complevel=7), 223 MB (HDF5, gzip, complevel=0), 97 MB (parquet, snappy).
- Peak memory use at 192.7 MB. Compared to 321.5 MB for pandas/pyarrow.
- We include much of the serialization/deserialization for you in the client libraries. In CPython:

- Takes about 6.436 seconds to convert to pandas dataframe.
- Also converts to list type about as fast.
- Full order book replay at 154k events per second dispatching a pass-through callback. (And much faster in our internal C libraries and on our storage cluster.)
- Same data structure used in our real-time and historical data services, which also transcode from binary to CSV or JSON (and parquet later) on server side. And we do have coverage of FX order book and tick data.

Clark Bruno · Jan 26, 2022

Would you please stop to pollute my thread with your completely unrelated and random comments?

jackfx said:
Thanks for info, very helpful to cut development time, and endless cycles of optimization, this is selling point of Cloud

Single large table is good idea when engine is highly optimized for specific functions, usually proprietary storage mgt.

Questions:
a. Just in case, when company close down, is it possible to save data to local storage? Prevent to be locked in
b. How to handle Hackers to steal my secret sauces ? All queries to Cloud exposes some degrees of algorithm, sure we make mistakes.

nooby_mcnoob · Jan 26, 2022

Databento said:
No problem.

Well, these days we see most of the top trading firms using a binary flat file format on POSIX storage for their backtesting workflows, and scale their IO with a distributed file system like Weka, VAST or Lustre. But as you've pointed out, that takes quite a bit of work. You could consider parquet for the position that you're in.

But otherwise - to shill a little - we are open-sourcing our binary flat file format along with our Python client library and documentation in about 1 month's time (and C++ library later) that will do a fair bit of that work for you. It is quite identical to the internal formats used at top prop firms. Tested briefly with 4,010,830 entries of MBO data on my 2020 MacBook Air:

- Writes at 1.3 GB/s.
- Reads at 3.5 GB/s (200+ million events per second).
- Takes up 71 MB on disk. Compared to 81 MB (HDF5, gzip, complevel=7), 223 MB (HDF5, gzip, complevel=0), 97 MB (parquet, snappy).
- Peak memory use at 192.7 MB. Compared to 321.5 MB for pandas/pyarrow.
- We include much of the serialization/deserialization for you in the client libraries. In CPython:

- Takes about 6.436 seconds to convert to pandas dataframe.
- Also converts to list type about as fast.
- Full order book replay at 154k events per second dispatching a pass-through callback. (And much faster in our internal C libraries and on our storage cluster.)
- Same data structure used in our real-time and historical data services, which also transcode from binary to CSV or JSON (and parquet later) on server side. And we do have coverage of FX order book and tick data.

That's pretty sick.

Datastore design recommendations

jackfx

jackfx

nooby_mcnoob

Databento

Sponsor

Clark Bruno

MrMuppet

Databento

Sponsor

Clark Bruno

Clark Bruno

nooby_mcnoob