Database organization

blah12345678 · Dec 9, 2014

Yeah, that's the big complaint with the free *NIXes - you have to build your environment yourself out of parts. Which is also it's biggest positive if you know what you want and know what you're doing..

Butterfly · Dec 17, 2014

gee, you guys really have no clue in database design, it's hilarious, like watching a monkey trying to ride a bicycle

John Tseng · Dec 27, 2014

I've taken a different approach than blah. My database schema is extremely simple:
Instruments table and Ticks table. I compute bars on the fly. So far it hasn't given me issues, but I'll probably create an indexed view if that's a problem. (Here's where I love Postgres: Materialized views (in Postgres) are a lot more forgiving than Indexed views (in SQL Server) at the cost of manually triggering an update.)

Right now, I only have 1 instrument in there for 75 million rows and 3GB of space. I don't have other instruments cause I'm running out of space. I'll get a bigger drive and more instruments once I start making money on this one.

I've compared the speed of SQL vs CSV. They're nearly the same speed. The biggest difference is that for CSV, I have to wait forever to start testing anywhere except the middle. For SQL, it's instantaneous.

Also, storing the results of the trades in SQL is very important for me, since I query that thing in a bunch of ways. Maybe I should stick it in a cube some day?

My schema for results look like this:
A strings table and a Trades table. The trades table records prices, times, and identifiers so I can separate out which strategy did what, and why it decided to do so.

I found most of the data issues came from storing trades too slowly. So I bulk load the trades in batches, and use a separate thread to do that.

volpunter · Dec 27, 2014

I like your approach to building compressed time series (bar series) on the fly something I do as well because there is no need to waste storage space when you can build higher frequency compressed bars from lower frequency data points.

Regarding your approach to storing time series, , don't you feel it speaks for the low throughput and mediocre performance of SQL and Postgres applied to time series (bad idea always, but I repeat myself) that you have to compare performance to csv text based files? Even a simple binary data store would blow any SQL or Postgres project away easily. Imagine the following schema:

Header[x bytes, fixed size] + Datapacket1 + Datapacket2 ...., where Datapacket = [DateTime as millisecond long integer, datapointA (such as BidPrice), datapointB(AskPrice), .....]

Each Datapacket within a given times series has the same fixed size in bytes. You use a "StoragePlugin" that shares the same interface with other storage plugins and the interface enables you to serialize datapoints, deserialize them, and it stores knowledge of the size of each data packet. In that way you can essentially store mixed data in the very same database. You can store time series with datapoints of type [timestamp, bid, ask] such as fx tick based data, or you could store [timestamp, last trade, volume] for a futures quote, or [timestamp, open, high, low,close]. The Header will provide information about the data structure such as size of file, number of data points, size of each data packet, symbol information,...

With such structure you can now easily run a binary search algo over an arbitrarily large time series and it will take a fraction of a millisecond to retrieve the matching position in your binary time series. From that datapoint you can simply read x number datapoints, convert on the fly to compressed data or run any other type of query. Sure, you do not have the benefits a relational db offers but actually you still do because you can simply read the data points in question and run relational queries in memory. That is basically anyway how SQL Server does it unless you heavily index your data. But you can essentially do the same by caching your queries or optimize queries in memory (I can provide more info on that if requested but it will be more in-depth).

In any case I am certain I run circles around anyone with an SQL or Postgres solution, I also run circles around anyone with an HDF5 database/file structure, and I even come incredibly close to KDB queries on time series. I would even do better if I further optimized my binary search algorithm.

John Tseng said:
I've taken a different approach than blah. My database schema is extremely simple:
Instruments table and Ticks table. I compute bars on the fly. So far it hasn't given me issues, but I'll probably create an indexed view if that's a problem. (Here's where I love Postgres: Materialized views (in Postgres) are a lot more forgiving than Indexed views (in SQL Server) at the cost of manually triggering an update.)

Right now, I only have 1 instrument in there for 75 million rows and 3GB of space. I don't have other instruments cause I'm running out of space. I'll get a bigger drive and more instruments once I start making money on this one.

I've compared the speed of SQL vs CSV. They're nearly the same speed. The biggest difference is that for CSV, I have to wait forever to start testing anywhere except the middle. For SQL, it's instantaneous.

Also, storing the results of the trades in SQL is very important for me, since I query that thing in a bunch of ways. Maybe I should stick it in a cube some day?

My schema for results look like this:
A strings table and a Trades table. The trades table records prices, times, and identifiers so I can separate out which strategy did what, and why it decided to do so.

I found most of the data issues came from storing trades too slowly. So I bulk load the trades in batches, and use a separate thread to do that.

John Tseng · Dec 27, 2014

I like your scheme. I'm sure it's incredibly fast. Since you know your application very well, you don't need to provide a general data interface that SQL provides.

I think, though, it's not worth it at this stage for me to use such a scheme. Once I put in any non-trivial strategy into my backtest system, however, data acquisition drops down to 2% of total time.

So I think if we were looking at database organization purely from a performance standpoint, the binary store would be best. However, once we account for the limited time I have to spend on this project (about an hour a day), development time starts to dominate.

For what it's worth, I still haven't spent the time to import some of my CSV data. I just wait the 5 minutes for it to run through the old data. I spend that 5 minutes working on refining my strategy.

volpunter · Dec 27, 2014

If that is the case then it seems your strategy might be very complex and computationally intensive or you possibly have not taken advantage of possible optimizations such as parallelizations or such.

John Tseng said:
I like your scheme. I'm sure it's incredibly fast. Since you know your application very well, you don't need to provide a general data interface that SQL provides.

I think, though, it's not worth it at this stage for me to use such a scheme. Once I put in any non-trivial strategy into my backtest system, however, data acquisition drops down to 2% of total time.

So I think if we were looking at database organization purely from a performance standpoint, the binary store would be best. However, once we account for the limited time I have to spend on this project (about an hour a day), development time starts to dominate.

For what it's worth, I still haven't spent the time to import some of my CSV data. I just wait the 5 minutes for it to run through the old data. I spend that 5 minutes working on refining my strategy.

cjbuckley4 · Dec 27, 2014

Hey folks. Here's what I've been doing with this the last week or so. I've changed my approach a bit.

1. I (re)built all the C# components that read the CSV into any dbms. They now read the tick data into a dataset. (Note: Depending on how things progress, I may or may not include some analysis in SQL, which would include loading the CSV into a table and querying it.) I will then use LINQ and C# to query/search for specific statistics on my data. These operations/statistics will include:
- converting to 1 second, 1 minute, and 1 day bars.
- converting any string values to numbers (using reference table to avoid storing 5 varchar/strings per tick).
- sort tick data into separate files by these criterias: exchange, security, and date.
- I may just use my data provider's flags, but I may come up with my way to flag 'bad ticks.'
- find and flag long periods of inactivity/illiquidity.
When all this is done, certain statistics will be stored in SQL tables corresponding to the data by day, these can be quickly aggregated into yearly and monthly stats. The locations of tick data and reference values for strings will also be placed on reference tables in SQL.
Next up: kdb+
- create a .q file called "<exchange>/<security>/<year>/<month>/<day>/<frequency>.q and store it at the location specified in the SQL.
- bulk insert the processed corresponding C# tick data dataset here.
- save kdb+.
Finally, utilization:
- I'm going to create C# and MATLAB (my two primary analysis environments...might do R, C++, and Python as needed/my interested change later) libraries that make querying this data easy. It will include a C# component I've already written to insert tick data contiguously in a Nanex Timeseries object I built in C# from kdb+ tick data. So now I need to build it to fit discrete time, as well as a way to parse both databases into appropriate MATLAB strictures. I'm sure my analytics in SQL will grow in the future, so I'll continue to evolve that part of the program.

The reason I'm doing it this way:
- q is hard, and there aren't a lot of resources available, so I want to do the heavy lifting and memory intensive stuff in C# and SQL where learning/doing is easier and I don't have 32-bit limitations on kdb+.
- this way, I mitigate the structure confusion problems that started this thread, as well as dealing with strings in q.
- I use q/KDB+ 32-bit where it performs well: querying tick data files all substantially under 4GB in memory. This will hopefully give me satisfactory speed for running many backtested in parallel on the same data (optimization).
- I use SQL to handle the dynamic and rather unstructured task of handling my analytics on tick data. I've had so many ideas for analytics and using SQL to store them all easily and scalably makes sense to me. I'm not sure SQL will set any speeds records returning 1 second (precomputed) bars on 2+ years of data, but still I prefer the flexibility right now for my purposes.
- I use C# to glue to all together and make sense of the complex stuff like rolling futures contracts, pre checking data, converting to bars and building the final dataset delivered from the databases. LINQ is really helpful for querying too.

cjbuckley4 · Dec 27, 2014

@volpunter. I love the flat file approach you've used. As I get better I might try that.

Risk619 · Dec 29, 2014

cj - One thing to consider is that you might want to really leverage your database for doing any set based calculations possible. I've found substantial performance bumps by doing things like sma, ema, wma, slope, arctan, stdev, and other such "technical indicators" in set mode. I'm using SQL Server 2012 Enterprise which has a lot of bells and whistles, but I'd imagine that any modern database management system can handle some or all of those.

Being able to have all the formula products produced as part of your entity framework (or whatever ORM) is huge for cutting down busy code and looping in C# (or whatever 3gl).

jtrader33 · Feb 5, 2015

volpunter said:
Imagine the following schema:

Header[x bytes, fixed size] + Datapacket1 + Datapacket2 ...., where Datapacket = [DateTime as millisecond long integer, datapointA (such as BidPrice), datapointB(AskPrice), .....]

Each Datapacket within a given times series has the same fixed size in bytes. You use a "StoragePlugin" that shares the same interface with other storage plugins and the interface enables you to serialize datapoints, deserialize them, and it stores knowledge of the size of each data packet. In that way you can essentially store mixed data in the very same database. You can store time series with datapoints of type [timestamp, bid, ask] such as fx tick based data, or you could store [timestamp, last trade, volume] for a futures quote, or [timestamp, open, high, low,close]. The Header will provide information about the data structure such as size of file, number of data points, size of each data packet, symbol information,...

With such structure you can now easily run a binary search algo over an arbitrarily large time series and it will take a fraction of a millisecond to retrieve the matching position in your binary time series. From that datapoint you can simply read x number datapoints, convert on the fly to compressed data or run any other type of query. Sure, you do not have the benefits a relational db offers but actually you still do because you can simply read the data points in question and run relational queries in memory. That is basically anyway how SQL Server does it unless you heavily index your data. But you can essentially do the same by caching your queries or optimize queries in memory (I can provide more info on that if requested but it will be more in-depth).

cj- I hope you don't mind me coattailing a bit in your thread. I have a few questions about my set up and if volpunter's approach might be able to help. You may want to cover your eyes, I am less knowledgable than anyone here and am assuredly going to ask dumb questions.

I have a hodgepodge of data including equities tick data (~7TB), 1min option data (120GB), EOD option data (80GB), and various equities interval data. All of it is in csv or txt files. The data is spread across non-RAIDed 7200rpm mechanical drives of 4TB, 4TB, 1TB, 750GB.

In short, backtesting is painful because my seek times suck. Some questions related to that if anyone is feeling charitable:

1) Can I assume that running multithreaded backtests (testing a different underlying on each thread) is beneficial for seek times despite HDDs only being able to read one file at a time? I believe I read somewhere that the OS will optimize queued read requests to minimize total seek time.

2) I understand that seek time is nonexistent for SSDs. I can probably move everything except for the tick data onto one. In addition to seek time improvement, can multiple threads read separate files off an SSD in parallel?

3) I'm not all that familiar with binary files but I understand they can be read faster than parsing text. However, read times haven't really been much of an issue. Are there other advantages to the "binary file store"? I honestly don't even really know what that means...is it simply creating/saving binary files the same way I have my csv files?