best way to hack up an rdbms to approximate something like kdb for poor/dumb people

Quote from DarthSidious:

I am still trying to understand how using infobright or any other column based database is helping you. A good paper on this topic can be found here. Quoting from that article: "If you're bringing back all the columns, a column-store database isn't going to perform any better than a row-store DBMS, but analytic applications are typically looking at all rows and only a few columns" says Gartner analyst Donald Feinberg. "When you put that type of application on a column-store DBMS, it outperforms anything that doesn't take a column-store approach."

Now, your typical query would bring back all data from all row all the time, wouldn't it? Except, well, if you are just comparing close prices of 2 or more instruments, or querying close prices only. In those cases, you can leave out the OHL data. But say for building bars etc., theoretically, there is no benefit.

Maybe some benefit based on the fact that some of the columns are compressed? (again, see the article).

I would be very interested in some hard numbers for comparison purposes. For example,
a) total # of rows, total # of rows for MSFT, total # of rows returned for "select * where ticker="MSFT" and from=thisdate and to=thatdate (say for 2 years)
b) same as above, but comparison between 2 stocks

Time taken for the above queries, and of course your hardware.

I think the benefit actually comes from compression, because the storage on disk after compression is substantially less than without compression. The results are returned to me quite quickly, in under a second. 16GB vs much less.

I'm not sure this is a long-term solution, but for my purposes, it's ok for now. I avoid sophisticated queries and use 'merge' in R, post-query, where date+time are in a range. I only use the database to fish queries between a certain date, like in your example. So I'm not really even using any sort of query optimizer or anything.
 
ok thanks. Will take me some time to load all the data and do some cleanup. Will keep it standard mysql for now, and after everything is up and running would do some testing with infobright & infinidb.
 
Quote from garchbrooks:

I think the benefit actually comes from compression, because the storage on disk after compression is substantially less than without compression. The results are returned to me quite quickly, in under a second. 16GB vs much less.

I'm not sure this is a long-term solution, but for my purposes, it's ok for now. I avoid sophisticated queries and use 'merge' in R, post-query, where date+time are in a range. I only use the database to fish queries between a certain date, like in your example. So I'm not really even using any sort of query optimizer or anything.


It boils down to the excercise of what a database is for - to find things. If you haven't come up with a plan on exactly what you are trying to find, all your energy is for naught getting caught up in technological issues, rather than the issue at hand.

Its like people looking for the holy grail when it comes to building indicators. They just tinker around trying to stumble into something. Don't get caught up in that trap. Its a trapdoor many go through and get stuck in a world of "hope".

You have time series data in your hands. The unique column is timedate stamp. Then OHLCV. Given that data, you can do SQL manipulations on it trying to find someting of interesting. You can create additional columns containing information related to the OHLCV DateTimestamp and search on it. You index on the stuff you are searching on to make it faster. Thats all this shit is about.

Do you know exactly what you are looking for?
 
Quote from FutsTrader111:

It boils down to the excercise of what a database is for - to find things. If you haven't come up with a plan on exactly what you are trying to find, all your energy is for naught getting caught up in technological issues, rather than the issue at hand.

Its like people looking for the holy grail when it comes to building indicators. They just tinker around trying to stumble into something. Don't get caught up in that trap. Its a trapdoor many go through and get stuck in a world of "hope".

You have time series data in your hands. The unique column is timedate stamp. Then OHLCV. Given that data, you can do SQL manipulations on it trying to find someting of interesting. You can create additional columns containing information related to the OHLCV DateTimestamp and search on it. You index on the stuff you are searching on to make it faster. Thats all this shit is about.

Do you know exactly what you are looking for?

Yes, I know what I am looking for. I designed a model based on an observation I saw in the market, and the performance of that model was so impressive I wanted to look for other candidates.

I've actually got 12 or 13 cores mining the database now. It took a few iterations to get to the point where I could use the data, and then I filtered it a bit. I managed to get several new candidates for trading. Of course, I'm not ashamed to admit my implementation is a piece of shit and no real institution would ever do things so shoddily, but it is what it is and I'm fine with my limitations for now.

So, yes, I am happy. infobright was the right tool for the job in the short run, but in the long run, I'm going to hit this problem again. Not much I can do, other than take what I have now, create some revenue, and use that revenue to finance more research and better operations.

The next time I look into this problem, I will probably try infinidb because their community edition is not crippled.
 
I was playing with the new InifiniDB Windows release this weekend. It doesn't use indexes, so to find a range of data it does a full column scan on one or more "extents" of 8 million rows. InfiniDB does it all in parallel, so it was cool to see all 4 cores running full blast. But even with 4 cores it was still a bit slower than MySQL MyISAM with a ridiculuosly huge & gratuitous index (with more cores it may be great though).

My conclusion was that if you need to query/join huge data in all kinds of different ways (including some that MySQL would completely choke on), it is great. But if you want to optimize one type of range query, it's hard to beat MySQL doing a straight shot from the index.
 
Also played with LucidDB this weekend. For my application, it is slower. InfiniDB was a tiny bit slower, LucidDB 2 times slower.
 
Quote from rosy2:

ok. i didnt notice that. but using a relational database for timeseries data has been discussed in other threads. its more trouble than its worth.

look into python, pytables, scikits.timeseries, scipy, and hdf5. with that you can easily window over the day of ticks

Probably the best advice for me .... HDF5 or similar.
 
I second the suggestions of using Cloud Computing. I use EC2 and it's great to be able to fire up an instance with my data loaded and 15 minutes later I can run a bunch of stuff, get results, and shut it down until I'm ready to test again.

You can start with a "Large" instance which is the cheapest 64-bit instance type at $.48/hour. When you need to do a lot, you can fire up the same instance on a Quadruple Large instance for $2.88/hr with these specs:
68.4 GB of memory
26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units each)
1690 GB of instance storage
64-bit platform
I/O Performance: High
API name: m2.4xlarge

You can develop your scripts/analysis algos on your cheaper machine at home. Then when you need to crunch numbers you can use a cloud instance.

The cost to hold 100GB in an EBS volume is $10/mo ($.10/GB/mo)

If you're going to run stuff 24/7, probably better to buy a machine. But you can get a lot of hours on this hardware for a few hundred bucks.

If you use EC2 (I haven't looked at all the alternatives) I'd recommend using the Elasticfox plugin for Firefox to make things easier.
 
Back
Top