Storing time series data

ajensen · Feb 21, 2019

Storing time series data
by Saeed Amen
Cuemacro
February 2, 2019
Subjects: Data, General, Python
...

When it comes to analysing financial markets, the ingredients are key,
essentially the data. However, with data, comes the need to storage.
We need to have a good “fridge” to use our burger metaphor. Typically
most financial data is in some sort of time series type format. You
have some timestamp, and then some observation, like a price, economic
data, news article etc. So the question basically amounts to what is
the best way to store time series? In this article I’ll go through
some of the solutions for storing time series data. It is of course
not an exhaustive list but hopefully it’s still a start.

...

****************************************************

What I do is use CSV or binary files named by the stock symbol. How do you store time series data?

fan27 · Feb 21, 2019

I use CSV. It is simple enough and I am not doing a bunch of reads and writes to disk.

Snuskpelle · Feb 21, 2019

The key question is whether you can fit all data into main memory and have some way to handle the cases where it does not. CSV in file by symbol name will get you pretty far though and it's what I've used for my amateur stuff. Would I be doing HFT analysis I would be thinking considerably more about storage...

tommcginnis · Feb 21, 2019

A lot of generalities flying by here -- in terms of data size, frequency of mass input, definition of "storage" (in size and in duration).... but:

For a lower-level, general purpose ap., I store, and then tap for one input, a single .csv file.
The output grows to about 10x the input size, but that is perhaps half graphics.
Size during computation is not an issue, but if it became one, I would focus on the size, not the readability. (As long as my input files are un-corrupted, I can regenerate anything I wish later...)

Readability by other programs is a bit of an issue, which .csv spanks.

I could double or triple operations right now, and not see a storage-related bottleneck, I guess. So the focus is on flexibility and durability. == .csv

andysinclair · Feb 21, 2019

I am storing time series data in a SQL Server database (I have a lot of experience with SQL Server), it's working really well.

I've built a library to make the data available in Python and that works well. I would recommend a relational database for time series data assuming you are not working with very low frequencies, i.e. minute or less.

nooby_mcnoob · Feb 21, 2019

I use SQLite and use SQLAlchemy to avoid having to code the CRUD operations.

Here is the worst query that converts hourly bars:

Code:

with daily_bars(time_series_id,time_series_name,open_datetime,close_datetime,low,high,volume,openinterest) as (
select
    time_series_id,
    time_series.name,
    min(baropen_datetime) open_datetime,
    max(baropen_datetime) close_datetime,
    min(low) as low,
    max(high) as high,
    sum(volume) as volume,
    sum(openinterest) as openinterest
from
    bar_data
inner join
    time_series
on
    time_series_id = time_series.id
where
    time_series.bar_size = 'HOURLY'
    and time_series_id = :time_series_id
    and date(baropen_datetime,:offsetHour,:offsetMinute) >= :start
    and date(baropen_datetime,:offsetHour,:offsetMinute) <= :end
group by
    time_series_id, date(baropen_datetime,:offsetHour,:offsetMinute)
order by
    min(baropen_datetime) desc
limit
    :window
)
select
    db.time_series_id,
    db.open_datetime as baropen_datetime,
    bdo.open as open,
    db.high as high,
    db.low as low,
    bdc.close as close,
    db.volume as volume,
    db.openinterest as openinterest
from
    daily_bars db
inner join
    bar_data bdo
on
    bdo.baropen_datetime = db.open_datetime and
    bdo.time_series_id = db.time_series_id
inner join
    bar_data bdc
on
    bdc.baropen_datetime = db.close_datetime and
    bdc.time_series_id = db.time_series_id
order by
    baropen_datetime asc

Felix168 · Feb 21, 2019

Assuming that you are interested in technical analysis, storing the data is unfortunately only half of the problem.

The stock data most likely also need to be adjusted, so that you can properly run indicators on them. Adjusting data is an operation, which goes _back_ in time, so you need to purge all the quotes you have stored, and replace them by freshly adjusted quotes.

Now you might think that this happens only on rare occasions. If you have stock splits in mind, that's probably true. However, data should also be adjusted for cash dividends (which most data feeds don't do).

If you are just a single guy, with no additional infrastructure to support you, maintaining a proper quote database will become a pretty substantial job. That's when it might make sense to chose a data provider, which does this work for you.

I personally use Norgate Data. Their tool, the Norgate Data Updater, maintains a database in some proprietary format. It will take care of the updating, by downloading new data on a pre-defined schedule, and (as it seems) they keep the split and dividend information separate from the quotes. Software can now connect to this database with their API, and request data with varying degrees of adjustments done: unadjusted, adjusted for capital restructuring, adjusted for capital restructuring and cash dividends.

FWIW, my TuringTrader project supports Norgate Data: https://www.turingtrader.org/2019/01/now-supporting-norgate-data/

Cheers, Felix

Craig66 · Feb 21, 2019

For what it's worth, I use binary files of tick data indexed by symbol and date.
When you need to test across years of tick data with hundreds of symbols, CSV just isn't going to cut it.

Howard · Feb 21, 2019

*.csv. But I only use 1-minute data.

tommcginnis · Feb 21, 2019

nooby_mcnoob said:
I use SQLite and use SQLAlchemy to avoid having to code the CRUD operations.

Here is the worst query that converts hourly bars:

Why for, "worst"?? That looked like a template that would have my Functional Programming friends all tingly with joy!