Datastore design recommendations

@Databento, question for you guys: do you offer access to either the raw or curated twitter feeds via api? I am currently tinkering with the gpt-4 api and see a lot of potential to use gpt-4 to speed up interpretation and analysis of fed and corporate news releases which I might want to base several trading algorithms on. Twitter itself does not seem to offer full access to the entire raw feed to end users.
 
Update to my previous post:

I now run all my ai related workflow on a wsl2 instance using Ubuntu on a Windows box. The clickhouse server runs on Ubuntu in wsl2 too. (neat thing is that I could configure the storage of the clickhouse server databases and tables on a separate nvme based raid array - - > extremely fast parallel data retrieval). Newer tensorflow versions don't support GPUs on a Windows machine anymore. Deployed deep learning based models all run on several proximity located Linux based dedicated servers.

Nice, I'm glad it worked out for you, and thanks for the update.

@Databento, question for you guys: do you offer access to either the raw or curated twitter feeds via api? I am currently tinkering with the gpt-4 api and see a lot of potential to use gpt-4 to speed up interpretation and analysis of fed and corporate news releases which I might want to base several trading algorithms on. Twitter itself does not seem to offer full access to the entire raw feed to end users.

Unfortunately, we don't.

Fed releases and other macro releases are difficult because there are premium feeds like AlphaFlash specifically optimized to monetize those events and deliver them from the embargo facility, e.g. K St in DC, to the trading PoPs in the least amount of time. This sort of long-haul wireless connectivity is not our expertise. And the companies who subscribe to these do have sophisticated NLP modeling capability and the ability to put on size.

There's probably a lot you could try with EDGAR filings though. That's on our roadmap, but you can honestly scrape EDGAR with open source tools pretty easily. We know several customers using Ravenpack for something like this.
 
Got it thanks for the prompt reply.

Nice, I'm glad it worked out for you, and thanks for the update.



Unfortunately, we don't.

Fed releases and other macro releases are difficult because there are premium feeds like AlphaFlash specifically optimized to monetize those events and deliver them from the embargo facility, e.g. K St in DC, to the trading PoPs in the least amount of time. This sort of long-haul wireless connectivity is not our expertise. And the companies who subscribe to these do have sophisticated NLP modeling capability and the ability to put on size.

There's probably a lot you could try with EDGAR filings though. That's on our roadmap, but you can honestly scrape EDGAR with open source tools pretty easily. We know several customers using Ravenpack for something like this.
 
Hello everyone, my first post.

I am an algorithmic trader and have so far stored my temporal sequence data in my own binary file data store. I am in the midst of changing parts of my architecture. I came across Clickhouse, an open source highly performant time series database and have one question, as I have never stored tick based data that I use for back testing in a conventional database:

I could import my currency tick based data for one symbol to experiment with performance optimizations. The schema is simply : timestamp (long), bid (float64) and ask (float 64), altogether three columns in the table. Do you now recommend to create separate tables for each symbol or would you store all symbols of an identical asset class in the same table? One symbol alone is around 370 million records. Adding 30 or more symbols for all my currencies is not a big deal, performance wise. The database can deal with many billions of records. But would this be advisable?

What do you guys do, particularly those who work with columnar databases?

@Clark Bruno, thank you so much for starting up such an interesting thread.

Would you please update on the Final Schema that you settled with and what other type of infrastructure are you currently using for querying your bid-ask tick data?

And suppose you need to store not just the bid-ask, but also the order book market depth information for atleast 20 levels depth or even the full depth of the order book, then will you still use the same schema or if you will need to modify the schema in such a case?

I was looking for the efficient solution for storing and analyzing many TB worth of order book data for many different symbols into a single table and then querying it to calculate a few additional columns of aggregated data to be used in further analysis and visualization work and I stumbled upon your thread, which has been very informative. So I am eager to know about your final approach in this regards.

Thanks and regards
 
Hello, I re-read your post and came across your point of wanting to segregate storage and compute. I am currently working on some design changes in my entire data acquisition, Analytics, research, and algo development pipeline.

Have you ever come across or used timebase? It integrates quite neatly with clickhouse and takes care of the management of data (such as order book buildup) post data acquisition while it out sources storage to clickhouse (or other DBs). Its open source and I wonder what your take on timebase is.

I am in particular looking for a solution that can perform both, the management of Realtime multi symbol data and the replay of stored timeseries (ideally, rebuilding the datastream in chronological order across different symbols)....

There's one more point worth mentioning.

Most naive binary flat file implementations are record-oriented, which makes it harder for a generic compression algorithm to squeeze out a high compression ratio. It take significantly more work to write a binary flat file design that employs column-oriented layout.

Clickhouse is fairly efficient at compression. In our own internal testing, a 7 GB CSV ends up being 6.7 GB in MySQL and 670 MB in Clickhouse.

The reduced storage requirements from the additional compression could be enough to pay for the marginal cost to switching to an all-NVMe setup on the same hardware, hosting and power budget.

On the flip side, we don't put all of our data on Clickhouse for various reasons that start to manifest when you run larger clusters like us. One is that we like to decouple compute from storage and scale them independently.
 
Hello, I re-read your post and came across your point of wanting to segregate storage and compute. I am currently working on some design changes in my entire data acquisition, Analytics, research, and algo development pipeline.

Have you ever come across or used timebase? It integrates quite neatly with clickhouse and takes care of the management of data (such as order book buildup) post data acquisition while it out sources storage to clickhouse (or other DBs). Its open source and I wonder what your take on timebase is.

I am in particular looking for a solution that can perform both, the management of Realtime multi symbol data and the replay of stored timeseries (ideally, rebuilding the datastream in chronological order across different symbols)....

We've not used TimeBase but it looks like it's a proprietary software by Deltix. Deltix itself is a reasonably good company, but we don't have data points beyond that.

We just use our own open-source file format (Databento Binary Encoding) for exactly the purpose you've described—it serves as an efficient multi-symbol, time-ordered file storage format while also acting as a message format for real-time data.
 
Was recently looking into this and while ClickHouse seemed decent, ended up going with DuckDB for persistency and Polars for in-memory operations. In recent tests DuckDB has been showing better numbers than CH but that's not all that matters, I find CH harder to work with and I like that DDB is single-file instead of client/server.
 
Was recently looking into this and while ClickHouse seemed decent, ended up going with DuckDB for persistency and Polars for in-memory operations. In recent tests DuckDB has been showing better numbers than CH but that's not all that matters, I find CH harder to work with and I like that DDB is single-file instead of client/server.

Interesting, thanks for sharing.

It seems the single-file limitation is the primary trade-off here. In our case we need horizontal scale-out, but a single server/file is plenty for most users and use cases.

Polars is nice. We're one of their official sponsors. ;)
 
What tests did you run that made you feel like duckdb is faster than ch? I did not get the same results at all but it heavily depends on what you test. I used the NY taxi dB to run analytics and ch was always faster than duckdb sometimes orders of magnitude faster. Also, for data streams as well as reads and writes of large timeseries did ch perform better. But I did spin up multiple clients, multithreaded. It scales exceptionally well.

Another advantage is that ch interfaces with many visualization tools, such as grafana. Not sure duckdb does so, too.

I managed to use ch as backend storage for timebase. I now have one single solution across the board that handles live data streams, off which I can run trading algorithms and/or persist the incoming data to disk/db,and I can use the very same platform to request timeseries from ch that are streamed through timebase for backtests. I migrate to an integrated platform that allows me to use the exact same technology whether I profile new ideas, test ideas or implement and trade algorithms, all on the same architecture. Plus I can do so either via Python or C++ or C# without changing a thing in terms of data storage and event processing. Plus I can quickly load any filtered ticks/bars/whatever into Polars dfs. No more wasted time with static and pricing data. Plus timebase comes from a highly respected house and specializes in highly performant financial trading and risk management solutions. Their code generator saves a lot of time creating pocos and other boiler plate code.

That makes me strongly favor timebase/ch over other solutions in open source space. Time will tell whether this particular solution makes it eventually into my production architecture but I am quite optimistic atm.

Was recently looking into this and while ClickHouse seemed decent, ended up going with DuckDB for persistency and Polars for in-memory operations. In recent tests DuckDB has been showing better numbers than CH but that's not all that matters, I find CH harder to work with and I like that DDB is single-file instead of client/server.
 
Last edited:
What tests did you run that made you feel like duckdb is faster than ch? I did not get the same results at all but it heavily depends on what you test. I used the NY taxi dB to run analytics and ch was always faster than duckdb sometimes orders of magnitude faster. Also, for data streams as well as reads and writes of large timeseries did ch perform better. But I did spin up multiple clients, multithreaded. It scales exceptionally well.

Another advantage is that ch interfaces with many visualization tools, such as grafana. Not sure duckdb does so, too.

I managed to use ch as backend storage for timebase. I now have one single solution across the board that handles live data streams, off which I can run trading algorithms and/or persist the incoming data to disk/db,and I can use the very same platform to request timeseries from ch that are streamed through timebase for backtests. I migrate to an integrated platform that allows me to use the exact same technology whether I profile new ideas, test ideas or implement and trade algorithms, all on the same architecture. Plus I can do so either via Python or C++ or C# without changing a thing in terms of data storage and event processing. Plus I can quickly load any filtered ticks/bars/whatever into Polars dfs. No more wasted time with static and pricing data. Plus timebase comes from a highly respected house and specializes in highly performant financial trading and risk management solutions. Their code generator saves a lot of time creating pocos and other boiler plate code.

That makes me strongly favor timebase/ch over other solutions in open source space. Time will tell whether this particular solution makes it eventually into my production architecture but I am quite optimistic atm.

https://duckdb.org/2023/04/14/h2oai.html

You can question the objectivity but it's a standardised test.
DuckDB used to be much slower but it has changed in the last year or so. Comparing with the same test from 2021. TimeBase seems proprietary and for that reason I will avoid it like the plague. I might choose to go commercial with what I'm working on, so locking myself into a proprietary platform is a no-go.

I will only need to interface using Python, backends of the packages I use are in C, C++, Rust so there's not much of a speed issue.

Another tool to keep in mind is Ibis. They intend to make it so working with any backend db system is standardised. So whether it's DuckDB or ClickHouse, wouldn't mean changing any part of the code outside of spinning up the server / library imports.
 
Back
Top