Database organization

Butterfly · Feb 5, 2015

is storing terabytes of useless tick data from the past really relevant for a winning strategy ? isn't the equivalent of reading dry tea leaves to predict the weather or better, the faith of humanity ?

it seems a lot of good IT resources are being wasted on useless data for useless and clueless "traders" who think they can read tea leaves better than the next person ?

I think the main concern shouldn't be if SSD or HDD are faster to run your backtest, but if you have any actual worthwhile strong strategy you can actually build or run in a live environment. Silly Backtests are not going to deliver you that, no matter the tech you will use behind it.

volpunter · Feb 5, 2015

1) No, reading a file at a time is in fact faster than anything that deals with queued or multithreaded logic on the reading of data side. Load data into memory and run backtests from data in your memory. Thats the best way to escape low read throughput and seek times

2) No, you can't read separate files at once, even if it looks like if you attempt it the jobs are still queued.

3) Now you say read times are not an issue? But question 1+2 directly deal with read time issues. So what is your bottleneck exactly? Have you been able to pinpoint the problem? Don't change or optimize anything that you were not able to identify its problem.

Read up on binary files. I do not want to say its lame but its lazy to ask others to invest the time to explain the basics to you. Thats what google is for. Happy to help when you come back with specific questions.

jtrader33 said:
cj- I hope you don't mind me coattailing a bit in your thread. I have a few questions about my set up and if volpunter's approach might be able to help. You may want to cover your eyes, I am less knowledgable than anyone here and am assuredly going to ask dumb questions.

I have a hodgepodge of data including equities tick data (~7TB), 1min option data (120GB), EOD option data (80GB), and various equities interval data. All of it is in csv or txt files. The data is spread across non-RAIDed 7200rpm mechanical drives of 4TB, 4TB, 1TB, 750GB.

In short, backtesting is painful because my seek times suck. Some questions related to that if anyone is feeling charitable:

1) Can I assume that running multithreaded backtests (testing a different underlying on each thread) is beneficial for seek times despite HDDs only being able to read one file at a time? I believe I read somewhere that the OS will optimize queued read requests to minimize total seek time.

2) I understand that seek time is nonexistent for SSDs. I can probably move everything except for the tick data onto one. In addition to seek time improvement, can multiple threads read separate files off an SSD in parallel?

3) I'm not all that familiar with binary files but I understand they can be read faster than parsing text. However, read times haven't really been much of an issue. Are there other advantages to the "binary file store"? I honestly don't even really know what that means...is it simply creating/saving binary files the same way I have my csv files?

jtrader33 · Feb 6, 2015

volpunter said:
1) No, reading a file at a time is in fact faster than anything that deals with queued or multithreaded logic on the reading of data side. Load data into memory and run backtests from data in your memory. Thats the best way to escape low read throughput and seek times

2) No, you can't read separate files at once, even if it looks like if you attempt it the jobs are still queued.

3) Now you say read times are not an issue? But question 1+2 directly deal with read time issues. So what is your bottleneck exactly? Have you been able to pinpoint the problem? Don't change or optimize anything that you were not able to identify its problem.

Read up on binary files. I do not want to say its lame but its lazy to ask others to invest the time to explain the basics to you. Thats what google is for. Happy to help when you come back with specific questions.

I hear you on the lazy bit, I'm definitely not wanting to be that guy. My question on binary files was too broad - I have read up on them in general as well as the tea files site you mentioned. Perhaps this is a more clear depiction of my testing and resulting bottlenecks (apologies if I'm not using the right terminology):

Seek time (finding a file on disk):
This appears to be the greatest bottleneck by far. I say that because an initial read of a file takes a multiple of the processing time for subsequent reads. I assume this is because the drive head is left in place so no additional seeking is required. Having said that, I've also noticed that reading other files and then coming back to read the original file is also much faster than the initial read...which is a bit confusing. I could just be getting lucky that the other files are located in close proximity to the original file, but I doubt it. Perhaps the OS caches the file location and can find it faster subsequently? No idea if that makes sense.

Essentially, I'm looking for ways to address this. My ideas so far:

-Was using //sym//sym_date.csv. Tons of small files doesn't appear to be a real smart choice since it's seek time x 3200[days] for each symbol. So I should probably move to sym.csv instead and use binary searches where appropriate.

-Put whatever I can on SSD

-Appreciate your comment on loading data to memory. I've done that before (with help from this site) but it was a case where the data set fit into RAM. Here it won't and my backtest executes much faster than I can read from disk. Is there still a benefit to doing this since I'll be waiting on reads to complete anyway?

-What I should have asked originally about the binary file store is does it improve seek (not read) times in any way? I haven't read anything that suggests it would but after reading your description, it felt like I might be missing something.

Read time (getting data out of the file once found):
I should not have said this isn't an issue, only that it gets dwarfed by seek time. What kind of improvement do you typically see over csv when reading binary files? 5% reduction in read time may not be worth it, whereas 30% would be a no-brainer.

jtrader33 · Feb 6, 2015

I ran some more tests and it's safe to say that I'm just making an ass out of myself now. I'm afraid to declare anything at this point, but in direct contradiction to my prior posts I think my issues in order are string parsing/object creation, file read time, then file seek time. Here are file read time results which I think demonstrate that assertion (3 runs for each test - completion time in milliseconds):

Symbol "A"
Method 1 [Read one large file from SSD and parse text -> create objects at each line]: 1899, 1899, 1875
Method 2 [Read one large file from HDD and parse text -> create objects at each line]: 1860, 1870, 1895
Method 3 [Read many small files from HDD and parse text -> create objects at each line]: 146590, 2259, 2061

Symbol "WLP"
Method 1 [Read one large file from SSD and parse text -> create objects at each line]: 2148, 2175, 2218
Method 2 [Read one large file from HDD and parse text -> create objects at each line]: 2098, 2120, 2184
Method 3 [Read many small files from HDD and parse text -> create objects at each line]: 2329, 2350, 2295

Symbol "WLP"
Method 1 [Read one large file from SSD and no object creation]: 247, 247, 248
Method 2 [Read one large file from HDD and no object creation]: 269, 272, 270
Method 3 [Read many small files from HDD and no object creation]: 456, 462, 461

Like I said, I'm wary of drawing any more conclusions here, but would it be possible to avoid the text parsing and object creation by putting my OptionQuote objects into an ArrayList and then storing that ArrayList to file as a serialized object?

Also, I have no idea why it takes forever to read many small files off the HDD (Symbol A, Method 3) the first time around, but that's been true any time I've looked at this. Not a big problem I guess, but strange.

Butterfly · Feb 6, 2015

you are hopeless,

first you are taking technical advice from volpunter, someone who has been exposed several times for having no clue on programming design, coding and programming pattern, so you are already fucked right there.

Take a strong RDBMS and structure properly your tables with a Database Design tool (Oracle Data Modeler for example, it's free), it will save you a lot of headaches to "repeat" your different backtests. Once you have "structured" your data, all those silly hardware issue will be irrelevant.

I have an idea from I am already reading in this thread that the above will be too difficult for you to accomplish. Maybe get some professional help, or else you will be stuck forever in Amateur Alley posting all kind of silly questions on EliteTrader.

volpunter · Feb 6, 2015

That is exactly where binary files shine given you have an efficient serializer and deserializer . Re seek times try to Defrag your hard drive.

jtrader33 said:
I ran some more tests and it's safe to say that I'm just making an ass out of myself now. I'm afraid to declare anything at this point, but in direct contradiction to my prior posts I think my issues in order are string parsing/object creation, file read time, then file seek time. Here are file read time results which I think demonstrate that assertion (3 runs for each test - completion time in milliseconds):

Symbol "A"
Method 1 [Read one large file from SSD and parse text -> create objects at each line]: 1899, 1899, 1875
Method 2 [Read one large file from HDD and parse text -> create objects at each line]: 1860, 1870, 1895
Method 3 [Read many small files from HDD and parse text -> create objects at each line]: 146590, 2259, 2061

Symbol "WLP"
Method 1 [Read one large file from SSD and parse text -> create objects at each line]: 2148, 2175, 2218
Method 2 [Read one large file from HDD and parse text -> create objects at each line]: 2098, 2120, 2184
Method 3 [Read many small files from HDD and parse text -> create objects at each line]: 2329, 2350, 2295

Symbol "WLP"
Method 1 [Read one large file from SSD and no object creation]: 247, 247, 248
Method 2 [Read one large file from HDD and no object creation]: 269, 272, 270
Method 3 [Read many small files from HDD and no object creation]: 456, 462, 461

Like I said, I'm wary of drawing any more conclusions here, but would it be possible to avoid the text parsing and object creation by putting my OptionQuote objects into an ArrayList and then storing that ArrayList to file as a serialized object?

Also, I have no idea why it takes forever to read many small files off the HDD (Symbol A, Method 3) the first time around, but that's been true any time I've looked at this. Not a big problem I guess, but strange.

volpunter · Feb 7, 2015

I usually manage one single file per symbol per quote type, such file can be 50kb or it can be 5gb. But you can split it up into files by year or month.

Also the reason why binary files are much better is that you do not have to read the entire file even if you just need a small part of the data in the file. However, with text based files you need to read the file from start until you find the location where you want to peruse data from. Plus the fact that string parsing is extremely slow in most languages.

volpunter said:
That is exactly where binary files shine given you have an efficient serializer and deserializer . Re seek times try to Defrag your hard drive.

John Tseng · Feb 7, 2015

We've gone through a few decades of optimizing disk access, so there's actually quite a bit of complexity here.

1) You are right that somewhere along the chain, disk accesses will be reordered to reduce overall seek time. The O/S is, however, not the best place to do it, since it doesn't know where the head is. http://en.wikipedia.org/wiki/Tagged_Command_Queuing is where the disk reorders the requests.

This, however, does not mean that you should throw multiple reads at the disk. Doing so guarantees that you will need to read from two different places in the disk, creating seek times. The best thing you can do for your disk access is to completely eliminate all seek times by ensuring that you read everything in order, from the beginning of the disk to the end. As volpunter has said, defrag will help here. Reading from only one file that is contiguous on the disk means that your disk never has to seek. It just reads from the beginning to the end.

If you MUST have random access, then multiple threads are better since you'll stack up a large number of requests in the disk. The seek times will be aggregated, and the average seek time will go down. So there, multiple threads is better.

There is also some caching here that you've experienced. The O/S will use all its spare (and sometimes not so spare) memory to cache files that you've recently accessed. This is why you saw that coming back to the same file is so much faster. The O/S didn't need to go to disk at all. It just grabbed it from memory. Postgres actually leverages this cache a lot. Unlike MSSQL, Postgres doesn't cache data in its own memory. It simply hopes that the O/S will cache it.

2) The technology behind SSD is very different, but it's not magical. If you'll look through any SSD's specs, you'll notice 2 interesting numbers for reads: MB/s and IOPS. The IOPS is usually for 4K blocks in random order. You'll see that the IOPS * 4K never matches up to MB/s. That's because random access is more expensive than sequential access. In reality, ALL SSD reads have seek times, including the sequential reads. It's just one line of flash disk is very large, so that we can get a large amount (much more than 4K) of data in one seek. To get the maximum throughput out of your SSD, you want to read sequentially, again.

There's a little twist here. If you MUST have random access anyway, then you DO want multiple threads just like with spinning disks. The reason, though, is slightly different. SSDs are more often than not arranged as multiple chips. Each chip can handle a number of simultaneous requests. If you can distribute your requests over all the chips, then you get some speedup. In fact, the IOPS speed you see on spec sheets is often at high queue depths (number of outstanding requests). If you want to achieve the high IOPS the manufacturer has promised, you must do so with a lot of random requests distributed all over in parallel. Large disks have more chips, and hence better IOPS.

Binary File) Binary files significantly reduce the work that the CPU has to do to read files. Parsing a string means it must do a few operations for each character, often more than 100 instructions to read a single number. If it was in the machine's native format, it can simply copy the data into memory and call it a day. 1 load instruction. This also has the added benefit of being much denser than a string representation so you can read more numbers per second.

From your tests, it looks like your bottleneck is more in the parsing than the disk seeks, with the exception of the multiple "A" files where you're really thrashing your disk. You will seriously benefit from a binary file representation. Lower level languages like C/C++ are really good at this as you can simply copy your memory into disk and call it a day. If you use higher level languages or want more portability, you'll unfortunately need to use a few more instructions to serialize into a network representation before storing to disk. Still, this is probably 10's of instructions as opposed to 100's.

jtrader33 said:
I ran some more tests and it's safe to say that I'm just making an ass out of myself now. I'm afraid to declare anything at this point, but in direct contradiction to my prior posts I think my issues in order are string parsing/object creation, file read time, then file seek time. Here are file read time results which I think demonstrate that assertion (3 runs for each test - completion time in milliseconds):

Symbol "A"
Method 1 [Read one large file from SSD and parse text -> create objects at each line]: 1899, 1899, 1875
Method 2 [Read one large file from HDD and parse text -> create objects at each line]: 1860, 1870, 1895
Method 3 [Read many small files from HDD and parse text -> create objects at each line]: 146590, 2259, 2061

Symbol "WLP"
Method 1 [Read one large file from SSD and parse text -> create objects at each line]: 2148, 2175, 2218
Method 2 [Read one large file from HDD and parse text -> create objects at each line]: 2098, 2120, 2184
Method 3 [Read many small files from HDD and parse text -> create objects at each line]: 2329, 2350, 2295

Symbol "WLP"
Method 1 [Read one large file from SSD and no object creation]: 247, 247, 248
Method 2 [Read one large file from HDD and no object creation]: 269, 272, 270
Method 3 [Read many small files from HDD and no object creation]: 456, 462, 461

Like I said, I'm wary of drawing any more conclusions here, but would it be possible to avoid the text parsing and object creation by putting my OptionQuote objects into an ArrayList and then storing that ArrayList to file as a serialized object?

Also, I have no idea why it takes forever to read many small files off the HDD (Symbol A, Method 3) the first time around, but that's been true any time I've looked at this. Not a big problem I guess, but strange.

jtrader33 · Feb 7, 2015

volpunter said:
Also the reason why binary files are much better is that you do not have to read the entire file even if you just need a small part of the data in the file. However, with text based files you need to read the file from start until you find the location where you want to peruse data from. Plus the fact that string parsing is extremely slow in most languages.

VP, appreciate you sharing your approach and being patient while I puke all over myself in this thread. As it turns out, putting my own clunky version of it together in Java was stupid simple. I think some of the terminology differences between languages had me thinking there was a lot more to it (as well as looking for things that don't exist in Java). Anyhow, the non-optimized first cut results are solid: average 320ms to read entire file and create objects. The best I could do on this symbol previously was 2100ms.

The catch here is that reading DataPackets that were stored as serialized objects is painfully slow in Java (~20s). However, writing primitive field values to binary files and then creating objects on the fly when reading those values performed much better.

As a result of storing field values, my binary files for option quotes are a repeating sequence of Long, Double, Long, Double, Char, Double, Double, Char. From what I can tell, the best alternative for conducting a binary search will be to read the file into a byte array and then conduct the binary search on array elements where index % 8 == 0 (using date as an example). Is reading the file into a byte array essentially what you're doing as well before conducting the binary search?

jtrader33 · Feb 7, 2015

John Tseng said:
The best thing you can do for your disk access is to completely eliminate all seek times by ensuring that you read everything in order, from the beginning of the disk to the end. As volpunter has said, defrag will help here. Reading from only one file that is contiguous on the disk means that your disk never has to seek. It just reads from the beginning to the end.

John, thanks very much for taking the time with that post. Very informative. For sure, I will be reading it multiple times to let it all sink in.

Initially, I thought you were making a theoretical point in the above quote, but after reading on it seems like it may be a practical endeavor. There's no mystery to the order in which files are read in my backtests and it seldom changes. Am I over simplifying things if I ask if there's a way to instruct the HDD to organize files in a certain sequence?