There's a fair bit of misinformation here. We maintain a small, 9 node Clickhouse cluster (40 TB) here at Databento for analytics. We also have experience with kdb. And we have a fair amount of experience storing order book and tick data (7+ PB raw and growing).
Clickhouse will perform very well against other column-oriented databases such as Vertica and kdb when compared on the same single node, because Clickhouse exploits many of the same hardware codesign principles, e.g. SIMD, instruction pipelining, linear memory traversal patterns, inline compression, page-sized optimized index structures etc. It also has the distinct advantage of being free, so you can spend your budget towards clustering more hardware for more performance. For our target writes and queries, our Clickhouse cluster outperforms a kdb setup of same cost.
Some selling points of kdb in this situation would be (a) more flexible architecture if you want to support multiple use cases
besides backtesting and real-time capture and (b) more powerful query language. You can solve least squares on server side on a massive design matrix with q in a single line, but you can't do that with Clickhouse.
Before asking if you should store the data in one large table, first ask if you have a good reason to store your data on a column-oriented database. Do you need:
-
Flexibility to write any arbitrary queries? e.g. Online queries, summary statistics etc.
-
To push compute to the data? As you've pointed out, your workflow could be slower on client side than server side.
-
Typically run queries that have smaller qualifying results than source data?
The last criterion is probably most important here in answering your original question. A column-oriented database is a good idea when you have patterns where you filter and/or aggregate on columns, largely because it exploits linear scan on the filtered column and late materialization and deferred tuple construction on the (intermediate) results. This partially answers your question:
if you already had a good reason to store the data in a column-oriented database, then you will probably have a good reason to store all of the symbols in 1 single large table to exploit things that it is good at - filtering on the symbol column, range selection on the time column and aggregating on other columns. Backtesting usually fails this criterion, because you're likely just seeking through contiguous data from start to end without any qualification.
The other strong reason to store all of your symbols in one large table is an operational one: Adding 1 table each time you add an instrument is not very maintainable. And joining across them is expensive. And prematurely paying higher maintenance cost before you've found a repetitive use case is usually a bad idea.
You may also find these popular posts by another one of our engineers useful:
-
What's the most efficient way to store options and time series data for backtesting?
-
Alternatives to RDBMS for options backtesting