Tick data storage

931 · Jun 29, 2020

Lately i have been optimizing storage for data and it appears that time series data holds enormous potential for utilizing custom compression methods to deliver very small formats while still retaining decent decompression speed.
Initialy was using lz4 but it appears not to be best choice for timeseries data, altough modified versions of lz4 named delta4c is claimed to achieve more.
But those products or sources are not yet released and it is work in progress...https://blog.quasardb.net/introduci...d-adaptive-lossless-compressor-for-timeseries

With more complex compression algos optimized specifically for time series it is possible to obtain at least ~94-98% compression on common market data aka 100tb of data vs 2-6tb.
But if some disk failure corrupts data its not as restorable as with simpler methods so i would choose more easily restorable format.
Lets say 1% of file gets corrupt and 2% lost after data restoration is acceptable.
Also best (compression ratio/decompression speed) might be too hard to achieve with higher complexity algos offering high compression ratios.

My goal is to find the best lossless compression algo in terms of (compression ratio/decompression speed) ratio to compare to my current implementations.
Surely some HFT firm or data related firm has better formats but it would not be obtainable on web.

The algo should be usable on both files and as in memory compression.

What are the best compression algos that specifically target timeseries data and high decompression speed?

931 · Jun 29, 2020

To add more context the algo will be used in custom database that buffers blocks of data from disk to ram on some threads while other threads work on data.
It utilizes both in memory compression and disk compression , atm both are same format.

From my initial findings it appeared that with some relatively simple compression configurations it is possible to load less than half sized timeseries files in at more than double the disk speed compared to regular formats, while keeping compression format middle out and decompression operations as extremely cheap.

Forgot to mention the buffered data is floating type and initial disk loaded data format is text based and compressed at <0.5 bytes per character.

Current plan it to use high capacity m.2 drives rated at 5000MB/s sequential read and see if CPU bottlenecks or doubles the disk speed.

globalarbtrader · Jun 29, 2020

Not an answer to your question, but some independent confirmation of your results that compression increases speed as well as obviously disk usage.

https://code.kx.com/q/wp/compress/

GAT

2rosy · Jun 29, 2020

what are you compressing? ascii?
take a look at this it's free and probably better than what you're doing
https://parquet.apache.org/

931 · Jun 29, 2020

2rosy said:
what are you compressing? ascii?
take a look at this it's free and probably better than what you're doing
https://parquet.apache.org/

Ascii like human readable format for long term data storage, floating for buffering.

globalarbtrader · Jun 29, 2020

931 said:
Ascii like fully readable format for long term data storage, floating for buffering.

I use https://github.com/man-group/arctic but it probably won't suit if you want to keep the ascii format.

GAT

931 · Jun 29, 2020

globalarbtrader said:
Not an answer to your question, but some independent confirmation of your results that compression increases speed as well as obviously disk usage.

https://code.kx.com/q/wp/compress/

GAT

You probably ment reduces usage.

globalarbtrader · Jun 29, 2020

931 said:
You probably ment reduces usage.

Slaps head....

GAT

931 · Jun 29, 2020

globalarbtrader said:
I use https://github.com/man-group/arctic but it probably won't suit if you want to keep the ascii format.

GAT

Its not using ascii based character table, but i have custom text viewer with decoder to see in ascii utf8 unicode or whatever Qt apps shows text as when QString gets displayed.

931 · Jun 29, 2020

After reading this article i started to see various compression algos and got interested.
https://github.com/VictoriaMetrics/VictoriaMetrics
This database software could have been written using lower level lang probably.

More importantly im looking for floating compression. The text based format is just for long term storage and reliability.