Tick data storage

Lately i have been optimizing storage for data and it appears that time series data holds enormous potential for utilizing custom compression methods to deliver very small formats while still retaining decent decompression speed.
Initialy was using lz4 but it appears not to be best choice for timeseries data, altough modified versions of lz4 named delta4c is claimed to achieve more.
But those products or sources are not yet released and it is work in progress...https://blog.quasardb.net/introduci...d-adaptive-lossless-compressor-for-timeseries

With more complex compression algos optimized specifically for time series it is possible to obtain at least ~94-98% compression on common market data aka 100tb of data vs 2-6tb.
But if some disk failure corrupts data its not as restorable as with simpler methods so i would choose more easily restorable format.
Lets say 1% of file gets corrupt and 2% lost after data restoration is acceptable.
Also best (compression ratio/decompression speed) might be too hard to achieve with higher complexity algos offering high compression ratios.


My goal is to find the best lossless compression algo in terms of (compression ratio/decompression speed) ratio to compare to my current implementations.
Surely some HFT firm or data related firm has better formats but it would not be obtainable on web.

The algo should be usable on both files and as in memory compression.

What are the best compression algos that specifically target timeseries data and high decompression speed?
 
Last edited:
To add more context the algo will be used in custom database that buffers blocks of data from disk to ram on some threads while other threads work on data.
It utilizes both in memory compression and disk compression , atm both are same format.

From my initial findings it appeared that with some relatively simple compression configurations it is possible to load less than half sized timeseries files in at more than double the disk speed compared to regular formats, while keeping compression format middle out and decompression operations as extremely cheap.

Forgot to mention the buffered data is floating type and initial disk loaded data format is text based and compressed at <0.5 bytes per character.

Current plan it to use high capacity m.2 drives rated at 5000MB/s sequential read and see if CPU bottlenecks or doubles the disk speed.
 
Last edited:
Back
Top