Database organization

EXavier · Dec 7, 2014

Anybody here using Ruby for database design? Any good books or references on the topic?

Butterfly · Dec 7, 2014

Ruby for Database Design ? LOL !!!!

cjbuckley4 · Dec 7, 2014

I can't really answer your question about database design, because im a DB user, not creator...that's above my pay grade (or at least a year above me in school).

In regards to using ruby in general for trading, I have sort of contradictory thoughts.
1. I'm building trading tools from essentially the ground up. It's hard. It takes a long time, and a lot more expertise than I currently have...I know some of my stuff is most assuredly garbage, and as such it's a highly iterative and time consuming process. I do it because I want to understand what's going on 100% and because I find it interesting. If you just want to hit the ground running in automated trading DO NOT do what I'm doing. Use ninja, trade station, Metatrader, Amibroker, whatever. There's no reason to do what I'm doing for 99% of people on this site.
2. On the flip side, go ahead and use ruby if you want. I don't know a whole lot about ruby, but people use all kinds of languages for trading tasks. If you're gonna be a hard head like me and insist on building your own stuff, then use whatever language makes you happy. Yeah, there are limitations, and every language comes with pros and cons, one of which being I can't think of a single broker with a ruby API (OANDA for FX has a RESTful API I've played with), but screw it man. If you think ruby fits your needs then go for it, I'm not going to tell you it can't be done.

AdrianHagh81 · Dec 7, 2014

for me rather then reinventing the wheel i just use ninja and multicharts, so i focus my energy on building and modelling custom machine learning algo.

i sfore tick data in myisam tables, which is really just a flat file, works pretty well.
i have a spare oracle license from work, after trying it out for tick storage, i decided it was a cumbersome overkill

eusdaiki · Dec 7, 2014

EXavier said:
Anybody here using Ruby for database design? Any good books or references on the topic?

Why do you think Ruby is a good fit for the task?

blah12345678 · Dec 7, 2014

For what it's worth, I have both a US stocks database and an EOD Option database.

I have an i5-2500 with 32 GB RAM and a 2 TB data storage.

I run FreeBSD and use Postgres for the database.

I get the data from Yahoo Finance.

There are 3 issues with this thread - (1) designing the database schema, (2) getting the data into the database, and (3) getting the data you want out of the database.

Item #1 is the most important because it directly affects #2 and #3. Number 3 is the 2nd most important item. What I've learned is that if you want optimal query performance, you need to identify and create indices on your frequently accessed data. Due to my liberal creation of indices (single and multi-column), I can get the entire price history for a stock symbol in milliseconds using Perl's DBI.
I perform all the calculations I desire in my queries ahead of time, and store them in the database. This way, the data I want is instantly available. I don't have to repeat the same calculations over and over for each study script I create.

I set up my databases in a relational manner because I don't want to have 1 gigantic table with 50-100 columns.

I also use Postgres because it works, I like it, I understand it, I know how to optimize the 3 items I listed above, and I know how to use Perl to get the data in and out in (mostly) optimal fashion.

My schema is as follows:

stocks_sectors: pk, sector, yahoo_url
stocks_industries: pk, industry, sector (fkey), yahoo_url
stocks_etfs: pk, symbol, company, type (stock or etf), exchange, yahoo_url, country, sector(fkey), industry(fkey), active, trades_options, etf_fund_family, etf_category
stocks_prices: pk, symbol (fkey), date, open, high, low, close, volume, factor_higher (for splits - multiply), factor_lower(for splits - divide)
stocks_facts: pk, symbol(fkey), shares_outstanding, yahoo_url, first_day_traded, days_traded, all_time_high, all_time_high_date, all_time_low, all_time_low_date
stocks_splitsdividends: pk, symbol(fkey) date, type (split or dividend), split, dividend, source (yahoo or stocksplithistory.com)
stocks_earningsdates: pk, symbol(fkey) date, source
stocks_volumes: pk, symbol(fkey), date, one_month, two_months, three_months, six_months, nine_months, one_year, two_year - this table stores the avg volume for the period for each date.
stocks_stddev: pk, symbol(fkey), date, std5, std10, std15, std20, std30, std40, std50, std60, std80, std90, std100, std120, std126, std150, std180 std200, std252, std365, std504, std730, std755 -- this table stores the std deviation of closing prices over 5 days, 10 days, etc.
stocks_hv: pk, symbol(fkey), date, hv5-hv755 -- this table stores the historical volatility of closing prices over 5 days, 10 days, etc.
stocks_pricespikes: pk, symbol(fkey), date, price_spike (based upon std20 in prices_stddev table)

Regarding the options database, due to the volume of data, I've simplified and created tables by expiration year. I haven't really split it up like it should.

But for the record, here are the row counts for the options database (from historicaloptionsdata.com):

2002 - 15.9 million
2003 - 25.9 million
2004 - 31 million
2005 - 36.9 million
2006 - 43.4 million
2007 - 49.3 million
2008 - 65.7 million
2009 - 77.3 million
2010 - 86.3 million
2011 - 101 million
2012 - 118.7 million
2013 - 130.8 million
2014 (until 8/31) - 137.4 million

Once again, liberal use of single and multi-column indices is the only way to pull the data out in milliseconds instead of minutes....

Hopefully this is a useful idea from someone who's doing instead of talking...

The key is finding a solution you like, you easily understand, you can extend with a little bit of coding, and most importantly, gives you the answers you seek as quickly and conveniently as possible with minimal maintenance and other issues.

cjbuckley4 · Dec 8, 2014

Thanks for the input. I like a lot of stuff about your approach. I also appreciate you giving me some information about how this thing may scale, which is definitely a concern for me. I have a similar set up, I know your Xeon is much better optimized for database tasks than my 3770, but I cant unbuy my Ivy bridge at this point. I do however have two 1tb SSDs dedicated to storaging this DB, so maybe that will help my performance from a hardware standpoint. I also have 32gbs of ram, which I hope will help me load up large datasets and give me the freedom to move them into different data structures without worrying. How much does having a lot of memory impact DB read performance in your experience?

Also, I would lloooovvveee to move this project back to *nix at some point! I admire you for that. My Backtesting program built in conjunction with this is mostly in C#, so maybe with .NET 2015 I can make that happen. From the humble student: what sort of advantages does using a BSD/*nix environment confer for this task? What OS tunes have you made (besides the plainly obvious one...setting swappiness to zero haha)?

I'm liking the idea of using a relational dbms more and more, just because I'm taking an online course that teaches about the relational model, so maybe I should start with a firm understanding there and see how it goes before I move to the column oriented ones. Thanks for the tip about indexing, the more I've learned, the more it seems important. Have you introduced any partitioning or played with that? Good advice on doing some calculations in advance and leaving them on my tables. Although I didn't have anywhere near as many as you, if you look on some other threads, you'll see im already doing that. Definitely a solid idea.

So I'm still a bit confused by one aspect of your (very informative) post. Is the data you're storing on equities tick data? I see you have EOD options data...that definitely makes sense, as you pointed out, OPRA is HUGE (those numbers even seem lower than ones I've seen in the past for some reason?)

My goal here (now) is to store nanex tape files from some futures exchanges. So far no storage issues, but I may want to expand to equities and options too later. Do you have any idea on how scalable that will be? These EVO 1TB SSD drives aren't cheap, so I'm hoping I can stick with what I have for a while!!

Thanks for your really solid advice. Good to meet someone on here who's built something similar to what I'm looking to accomplish.

blah12345678 · Dec 9, 2014

I don't have a Xeon. Just checked and noticed. I have a dual-core i5-3570... Better than the i5-25xx I thought I had, but it's still way down the list on the PassMark rankings ....(http://www.cpubenchmark.net/high_end_cpus.html)

FreeBSD 10.0-RELEASE-p3 #0 r266262: Fri May 16 14:18:25 EDT 2014
CPU: Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz (3401.21-MHz K8-class CPU)
real memory = 34359738368 (32768 MB)
avail memory = 33284837376 (31742 MB)

I use 4-500GB 7200rpm SATA3 drives in RAID 0 config. Nothing fancy or particularly fast.

I have 15 years experience using FreeBSD... Some of the things I do will not necessarily make sense to you unless you've been using it a couple of years... Just have to remember the main thing - FreeBSD is an operating system, Linux is a kernel. If you don't need bleeding edge, and prefer to do real work instead of working to keep your system stable and usable, choose FreeBSD over Linux. Just be sure to first check and make sure *all* of your hardware is compatible before installing. Really. I mean it. All of your hardware.

The Handbook on FreeBSD.org is all you really need to get your system up and running, so be sure your laptop/tablet/phone is close by....

FYI - Rackspace offers a FreeBSD 9 instance if you want to play around and test things before converting your hardware.... You can spin it down when you're not using it so you don't get charged for the time. Also, it's more convenient to blow-up a virtual instance than a real one. If needed, all you need to do is just delete the bad one and start over.

Here's my /etc/sysctl.conf:

kern.module_path=/boot/kernel;/boot/modules;/usr/local/modules

net.inet.tcp.blackhole=2
net.inet.udp.blackhole=1
net.inet.tcp.log_in_vain=1
net.inet.udp.log_in_vain=1
net.inet.ip.random_id=1

security.bsd.see_other_uids=0
security.bsd.see_other_gids=0
security.bsd.conservative_signals=1
security.bsd.unprivileged_read_msgbuf=0
security.bsd.unprivileged_proc_debug=1
security.bsd.unprivileged_get_quota=0
security.bsd.hardlink_check_gid=1
security.bsd.hardlink_check_uid=1
security.bsd.suser_enabled=1

kern.coredump=0
kern.maxfiles=131072

kern.ipc.shm_allow_removed=1

# shared memory tuning for Postgres
#
kern.ipc.shm_use_phys=1
kern.ipc.shmmax=34359738368
kern.ipc.shmmin=1
kern.ipc.shmall=8388608

Here's the relevant portion of my postgresql.conf as suggested by pgtune (included with the Postgres install):

default_statistics_target = 50
max_connections = 80
work_mem = 192MB
maintenance_work_mem = 1GB
effective_cache_size = 22GB
wal_buffers = 8MB
shared_buffers = 7680MB
constraint_exclusion = on
checkpoint_segments = 256
checkpoint_completion_target = 0.9

blah12345678 · Dec 9, 2014

All my data is EOD....

cjbuckley4 · Dec 9, 2014

I see. I imagine EOD data is gonna be substantially smaller than tick data, but a lot of your practices still apply. Thanks for the FreeBSD info as well. It's a cool OS for sure, just right now as I'm getting things set up I've found C# with LINQ to really simplify a lot of issues for me. So right now im really just prototyping all this stuff in SQL Server, but I really hope to move back to linux at some point. As much as I've come to love Ubuntu, I must say that the Windows provides a far more streamlined platform for initially getting things set up.