Time series DB?

CME Observer · Dec 25, 2017

sle said:
I guess I have to do some real cost/benefit analysis before I pick a direction. The inputs are as follows:
- very small team, one techy young monkey and one useless old fart that can barely boot up his MacBook; so we can’t support heavy technology stack
- small number of assets that do require tick data work at this point in time, across a fairly small number of assets (probably talking a few hundred symbols, only 10-20 used concurrently); so organizational aspects are not crucial, I can probably use file-system based structure
- at some point, I might move in the direction of doing more of the latency sensitive stuff; so flexibility is important
- main requirements are rapid reading and writing of rather large blocks of data (intraday we dump ticks into a text file) for research and back testing

Is there a reason I don’t want to go with bcolz given the above?

Okay maybe my solution is technically a bit ambitious given that you won't have an unlimited capacity/patience for IT on your team. I think the hosted pub/sub + cluster of subscribers approach is sexy but might be overkill given the above. Certainly untested.

Personally, I can't comment on bcolz but does seem interesting. I'll also say that given your really have a log replay vs. an adhoc query requirement I think the proper abstractions above a file system consisting of binary files of tick data is a viable low headache approach. Obvious you'd want to write an API that abstracts all the headache of text files. I'd say test bcolz though. Worst case scenario is you've probably just made migration to whatever DB you try next easier than it would be with plain text files. If I was in your shoes I'd probably give mariadb some thought as well.

2rosy · Dec 25, 2017

given your requirement bcolz works. I used it previously due to memory limits with pandas.

Simples · Dec 25, 2017

In order to do the data querying, processing and output, how many machines will be needed, ie. will you now need a distributed event processing platform. With enough CPU and parallellism, your network becomes your bottleneck and something like Kafka will be useful for moving/splitting datastreams and even do realtime processing while handling most of the network side for you. Chosing that route will require some initial investment, but will be able to scale up to what your network can handle, and also provide some flexibility for querying and processing (unlike traditional queues).

If 1-3 boxes for the obvious parts, you could get away with filesystem, especially if mainly just used for replay, not complex querying and reordering. Using Linux, you can allow lots of open files. Alas more concurrency will require more and more seeks in between, a simple approach like this can really scale if you manage the complexity yourself, keep it simple and it's not too much data. For just reading/streaming, in-memory is really overkill, as network is often the most limiting factor nowadays, not harddrives.

Most start with a DB, especially an RDBMS if they can get hands on it, but it really limits your design from then on, so choose wisely and don't be in a rush to settle. It can always be added later, is often the best architectural decision you can do now, because you never know what you'll need in the future anyways. If what you need now screams at you, you don't really need to ask, do you?

sle · Dec 25, 2017

i960 said:
I just saw that you posted 10-20 tick level instruments. How much history are we talking? It might be better to just get some raw numbers out there because any solution you use is going to be massively dependent on storage and *how* you structure the data.

Probably talking a 100GB of data at the very very most if I do add other strategies I am planning to add.

i960 said:
Are you the old fart here?

Yup, unfortunately

i960 said:
Unless you've got something novel, model-wise, don't you think you're probably outgunned from the start here? My mind would be blown if commercial entities in the latency sensitive space are using any interpreted languages whatsoever at runtime (of which python is an interpreted language [and a slow one at that]). You'd have to be calculating something they're just not even remotely interested in in order to win that war - and who knows, that might just be the case. I'd say if you can avoid this entire space for as long as possible it's probably for the better.

It is "novel" in a sense that it's high-effort from modeling perspective. In any case, when I say "latency-sensitive" I don't mean UHF, I simply mean strategies that hold positions intraday. The run time engine is using C++, but I do use python for testing and research (also, real time model inputs that are not latency sensitive are generated in python too).

i960 said:
Just straight dumping doubles via printf or something along those lines? Also, how many days of intraday data? The text file dump of ticks won't scale for anything highly liquid and encompassing months worth of data. If it's just 24h worth of data then it'll probably work but I don't know how you'd backtest anything with just that.

Well, I dump them into a typical date-organized directory tree, so the biggest files are 24 hours for a single symbol. Since I have such a small number of concurrent symbols, I have a separate process handling recording for each symbol. I don't think we ever had any problems with it being too heavy.

i960 said:
bcolz may work fine for you, I'm just skeptical of the scalability of any database implemented on top of an interpretive language because they simply don't perform or scale. They eventually all hit bottlenecks requiring use of underlying components written in native languages in order to keep scaling. Generally they're good for scaffolding or prototyping things but they all eventually hit a wall (with unfortunately too many people throwing hardware at the problem rather than changing the algorithm [in this context, the implementation language]).

Seems like (I have not read it in detail yet) bcolz has most of it's logic written in C/Cython, which should be pretty fast. We already are going to throw some hardware at it

, my minion is supposed to make a decision if we want buy our own hyper-box or find a virtual solution.

sle · Dec 25, 2017

CME Observer said:
I'd say test bcolz though. Worst case scenario is you've probably just made migration to whatever DB you try next easier than it would be with plain text files. If I was in your shoes I'd probably give mariadb some thought as well.

Makes sense. The added advantage of figuring out an open source solution is that should I decide to move to another firm (or should it be be decided for me, for some reason

) I can quickly re-deploy the same technology stack.

Simples said:
If 1-3 boxes for the obvious parts, you could get away with filesystem, especially if mainly just used for replay, not complex querying and reordering. Using Linux, you can allow lots of open files.

So far, given how small the team is and how little resources we are willing to spend on maintaining this bit, we were going with a single box. The hardware is so cheap these days that we can get a machine that fits our entire dataset into memory for under 10k (I think).

Simples said:
With enough CPU and parallellism, your network becomes your bottleneck and something like Kafka will be useful for moving/splitting datastreams and even do realtime processing while handling most of the network side for you. Chosing that route will require some initial investment, but will be able to scale up to what your network can handle, and also provide some flexibility for querying and processing (unlike traditional queues).

Right. I wonder if that can be added later - at the moment, I just want a simple solution that would be easy to deploy and maintain.

PS. Someone just pointed out that since all of my data sets fit into memory, there are standard formats that pandas supports that are blindingly fast.
PPS. Apparently, either feather or hdf5 formats are my best choices if I want to go down a binary file route. In both cases, no extra support needed (as it's fully integrated into pandas), plus hdf5 has a very good C++ support.

temnik · Dec 25, 2017

It's great to finally find a sane thread on this website...

Personally, I'm using influxdb. It's not ideal - especially compared to something like kdb. But it's good enough for now. You can take a look at http://community.influxdata.com to see what kind of real problems real people have to deal with.

What I don't like about blosc/bcolz off the bat is that it ties you into python ecosystem. Yes, python is #1 in machine-learning right now, but I like to keep my options open.

sle · Dec 25, 2017

temnik said:
Personally, I'm using influxdb. It's not ideal - especially compared to something like kdb. But it's good enough for now. You can take a look at http://community.influxdata.com to see what kind of real problems real people have to deal with.

It's on my radar, but I don't know if I can supporting yet another stand-alone product when I can get away with a simpler solution. What made you pick this vs say any other time series databases (arctic looks pretty impressive, I'd say)?

temnik said:
What I don't like about blosc/bcolz off the bat is that it ties you into python ecosystem. Yes, python is #1 in machine-learning right now, but I like to keep my options open.

Yeah, that's for sure - upon some thought, I am leaning towards HDF5 instead of bcolz. It's a common scientific format, supported across pretty much every language and has a pretty broad base in and outside of finance.

2rosy · Dec 26, 2017

sle said:
Yeah, that's for sure - upon some thought, I am leaning towards HDF5 instead of bcolz. It's a common scientific format, supported across pretty much every language and has a pretty broad base in and outside of finance.

people tend to go from hdf5 to bcolz (if not kdb). is your data ticks/events or is it already normalized somehow? if its events, might want to look at queues(ie. kafka) for base storage then something else when analyzing.

wmli · Dec 26, 2017

sle said:
I am leaning towards HDF5 instead of bcolz. It's a common scientific format, supported across pretty much every language and has a pretty broad base in and outside of finance.

I have had great success with PyTables, a Python library for HDF5. I store options tick data and load it into pandas/numpy.
http://www.pytables.org/usersguide/tutorials.html

sle · Dec 26, 2017

2rosy said:
people tend to go from hdf5 to bcolz (if not kdb). is your data ticks/events or is it already normalized somehow?

I store both T/Q ticks (since I don’t play UHF games, no book updates thank God) and resample to 1 second and 1 min. Why would Bcolz be better? Is it faster?