Database organization

eusdaiki · Oct 22, 2014

that 32 bit is going to become a wall on your project rather quickly as you start loading the actual ticks into the db.
If you can go with an open source alternative on 64 bits from the start then it'll save you a lot of headaches later.
there are several column oriented databases (like kdb) in the open source world.
For example HBase & Cassandra in the Hadoop ecosystem, or MonetDB if you want something closer to SQL without hadoop.

cjbuckley4 · Oct 24, 2014

Thanks for the input here guys. This project is definitely coming along and I appreciate your advice. Being the stubborn kid that I am I have "decided to see for myself" (part of this decision was based on some input from folks on the kdb+ user's google group who I asked about 32-bit's feasibility). I currently have plenty of of HDD storage capacity to fill with data for testing purposes short term, so once I'm done building the C++ logic to interact with the databases I'm going to test both MonetDB (because I don't know hadoop but have SQL experience) and q/kdb+ with the same data and same partitioning and see what does better as things get bigger.

Again, I respect and appreciate the opinions of everyone who has given me input here, but I'm reasonably confident if I partition my tables in kdb+ I will be able to get around a lot of the limitations. That's why we test things though; I'll see which one works better for this application and report back on it so that other folks can see this in the future (NOTE I: it will probably be quite a while though because of midterms/school and my lack of experience with this sort of project). (NOTE II: As things progress here I may also decide to test some implementations with hadoop, but that isn't in the works yet).

Butterfly · Nov 11, 2014

not sure what 32bit and 64bit has anything to do with it,

you can store gigabytes of data even in 32bit system,

from the OP, this seems to be a very simple issue, not sure why it's being made more complex than it is

Butterfly · Nov 11, 2014

cjbuckley4 said:
3. The size of my database. This thing is obviously going to become HUGE. I figure I can't really know how huge until I try or someone tells me. I've been reading about different compression options from my database of choice and believe I'm using the right tools here, but maybe I will still need to trim back the amount of data I'm looking at. To be clear, I'm looking to database TRADES (not quotes) on S&P 500 equities and options, a few futures, and a few ETF/P/Ns. I really have no idea how big this is gonna get, so I'm just gonna try it and see. I'm prepared to invest in 4 or 5 TB hard drives every once in a while, but I don't have the means or the expertise to run a legitimate data server (I don't think I do at least, don't even know what the cost would be...if anyone has any estimates of what it will cost to scale this approach I'm all ears).
4. Backups. Simply put, I want to make sure all this data I'm working on capturing is safe. I'm prepared to pay a good amount for backup space, but again I worry it could become unmanageable as I scale.

you sound very new and inexperienced in that domain, which could explain why you ask so many tangent questions that go in every directions

from the sound of it, it doesn't sound like much. A database can store a lot of data in very little space, as long as your data model is properly structured and optimized. It seems that you have never designed a data model, so maybe you should start there before proceeding to select a database server or database technology. Unless you have a very specific need, most database software out there will do. One remaining question is their API, and what kind of language you are going to interact with them.

mySQL/PHP is a good combination for a beginner, Python/mySQL/MariaDB/MonetDB is also an option, and even PERL/MariaDB is a good combination for a beginner.

I wouldn't bother with C++/Java at this point.

I dare say you could do the whole thing with VB/Access if you are really new to all this.

cjbuckley4 · Nov 11, 2014

I am inexperienced with databases; this will be my first large database build. I built a tick data database in MySQL last year using tick data from Integral's TrueFX. It grew to about 250 GB uncompressed for the 5 majors over the span of about 3 years of data, so I was operating under the assumption that this database would grow much bigger. I've spoken to some people who have build similar systems professionally, and they've pretty unanimously told me that this is a large scale endeavor. Some of the companies I've spoken to have petabytes of tick data, so I was led to believe anecdotally that this would be a large database. Although I didn't partition my tables or anything, my MySQL implementation was unusably slow. Again, I appreciate the comment--albeit slightly condescending--but no one here has really answered the original question regarding how to best design the database. I've since come up with a schema that I believe works, but I'm sure you'll agree that it's rather quixotic to come answer my question by saying "it's basic" and then not answering the question and contributing to the narrative of what tech stack I should use. I know my tech stack already, I use what I believe is going to make me marketable for jobs in this field, I've signed on to do some experiments above, but otherwise, I'm really not interested in discussing what the optimal language will be here. I don't really care if this database takes me longer to build, I care about not looking like a total idiot in prop trading job interviews...which, believe me, is a way bigger task than building a database. I don't mean sound like I'm personally attacking you, but I'm getting kinda tired of the pervasive motifs of "who's smarter", "who's the better trader", and of course "language X is better than language Y, and no one on this site has any business using language Z." Again, not a personal insult, you've actually offered more constructive feedback than most of the answers I've gotten, but let's be honest...if I messaged you on quora/quantnet/LinkedIn/nuclearphynance and asked you the same question and offered to buy you a coffee to discuss you'd treat me differently. I'm not sure where that gets lost here, but maybe it's me being quixotic this time. Anyway, thanks for your feedback.

To anyone still following this thread regarding my tests, I'm still working on this project, and so far kdb+ 32 has been substantially faster, although I've partitioned my tables so that 4gbs hasn't really been a major limitation yet. I'll post more empirical findings later.

Butterfly · Nov 11, 2014

not sure what getting a job at prop trading firms got to do with your ability to build a database system

prop trading firms want workable strategies, not database tech monkeys

250GB uncompressed database ? not trying to be condescending again, but you do sound very amateurish, confusing datastore with database schema

again start by publishing your data model here, or the one you you did with mySQL for storing your tick data. Anyone can fill a database with garbage and make it huge, that's usually the trademark of mediocrity, thinking that bigger is better. The real talent is to make it as small as possible, not over sized, which is the typical amateur mistake. Think ACID, which is the corner stone of database design. If you can factor poorly built database from 250GB to 100GB or even 10GB, then it will be a real demonstration of your technical skills, something you could actually promote in any job interviews.

cjbuckley4 · Nov 11, 2014

I've actually since deleted that entire database, but my schema was about as simple as it gets. I had a table for each major pair, a column for timestamps, bids, and asks (all the data they provided). I used a UNIQUE key too because their data had duplicate entries. Again, I made no effort to compress it, but of course, ACID is what I should strive for. I knew even less about databases then than I do now. That's all I can really remember. All the prop trading firms who talk to me seem interested in is programming, and that seems logical to me. The HFT business is built on technology.

The model for my current db:
I programmed up the roll algorithm for creation of new futures contracts (only covering CME right now) so that I can add the expiration. Each day, I create a new table for each future I watch in a location on a filesystem based on year/month/day. The file location of day of data per contract + expiration + name of the contract is kept on another table so that I can query ranges from there. The actual kdb+ tables have this model:

TABLENAMEDATEDTERMINEDBYROLLALGORITHM

[ timestamp:'time$(); bid:$'float$(); ask:'float$() ] lastPrice:'float$(); lastSize:'int$(); totalVolume:'int$(); tickID:'int$(); basisForLast:'symbol$(); tradeMarketCenter:'int$(); tradeConditiond:'int$() )

I haven't figured out how to get options data to work yet because I don't know all the strikes for each underlier. I'm thinking of only doing options within like 3-10 trances of strike prices, but I'm not there yet. I've decided to only do CME futures and future options for the moment, but will add CBOE and then some equities as I need them. I'm also hopefully going to start recording realtime data soon so I can get quotes too. There's always better data though, I hope one day to be colocated and to database data coming off of exchanges in real time. I appreciate you looking at this.

volpunter · Nov 12, 2014

So let me understand this correctly: You pose as the big swinging dick here and cut this guy down left and right (despite him having been 100% honest and forthright in that this is something new to him). Then you go ahead and recommend an SQL based database to store columnar data structures? You still are an idiot!!!

And the programming language is one of the lowest priority concerns when thinking about storing data. Most popular database systems target multiple languages, so I have no idea what the issue regarding language choice is and why Java and C++ should be out of the question regarding API.

And then at the end you tell all of us that this amount of data can easily be handled in VB or Access? Can't this website really not implement some IQ test before it lets people join?

Butterfly said:
you sound very new and inexperienced in that domain, which could explain why you ask so many tangent questions that go in every directions

from the sound of it, it doesn't sound like much. A database can store a lot of data in very little space, as long as your data model is properly structured and optimized. It seems that you have never designed a data model, so maybe you should start there before proceeding to select a database server or database technology. Unless you have a very specific need, most database software out there will do. One remaining question is their API, and what kind of language you are going to interact with them.

mySQL/PHP is a good combination for a beginner, Python/mySQL/MariaDB/MonetDB is also an option, and even PERL/MariaDB is a good combination for a beginner.

I wouldn't bother with C++/Java at this point.

I dare say you could do the whole thing with VB/Access if you are really new to all this.

volpunter · Nov 12, 2014

Some of the guys posed already pretty decent recommendations to combat your problem. I would also recommend to go with a NoSQL solution that specifically optimizes columnar data series and its queries. The language of the API should really be driven by your expertise and comfort level because most data base solutions are targeting a myriad of languages.

cjbuckley4 said:
Thanks for the input here guys. This project is definitely coming along and I appreciate your advice. Being the stubborn kid that I am I have "decided to see for myself" (part of this decision was based on some input from folks on the kdb+ user's google group who I asked about 32-bit's feasibility). I currently have plenty of of HDD storage capacity to fill with data for testing purposes short term, so once I'm done building the C++ logic to interact with the databases I'm going to test both MonetDB (because I don't know hadoop but have SQL experience) and q/kdb+ with the same data and same partitioning and see what does better as things get bigger.

Again, I respect and appreciate the opinions of everyone who has given me input here, but I'm reasonably confident if I partition my tables in kdb+ I will be able to get around a lot of the limitations. That's why we test things though; I'll see which one works better for this application and report back on it so that other folks can see this in the future (NOTE I: it will probably be quite a while though because of midterms/school and my lack of experience with this sort of project). (NOTE II: As things progress here I may also decide to test some implementations with hadoop, but that isn't in the works yet).

cjbuckley4 · Nov 12, 2014

Thanks for your input @volpunter . Like you've said, I agree column oriented is the way to go here, and I've had somewhat of a bad experience with SQL for a similar project I alluded to above. I think you're also correct in saying that most database languages are targeted at a wide enough variety of languages. I was initially a little concerned with monetDB's MATLAB connection options, but I've since figured that out.

Judging by your name, maybe you could help me answer one of my earlier questions: how to best store options tick data? I'm thinking that for each day I may store something like n<=10 strike prices away from yesterday's close and then constrain the max and maybe min time to expiration in some way.

That may be a bit unclear, but the issue im running into is that I don't know how to programmatically determine a 'roll algorithm' for options like I have for futures, so I figured if I just store all the options within a certain number of tranches of strike prices yesterday's close and in a certain interval of expirations then I could effectively programmatically capture all the options data necessary for backtesting. Any thoughts on how to best solve this problem? Any advice on databasing options data in general? I found futures pretty easy to solve once I looked on the exchanges' websites, but databasing options is still a question for me.

The other, probably bad, idea I had for tackling all these problems was this: store ticks on the symbols I watch as they come in to one single table (partitioned probably daily depending on size) and use a reference table to direct my queries to the proper table. This would effectively eliminate the need to 'roll' any of securities and it would basically allow me to play the tape in fast forward for Backtesting. A lot of the kdb+ example code uses this methodology, so I assume it might be viable. Maybe this is another thing I can test.