Large sorting in R

spacewiz · Aug 7, 2015

volpunter said:
I fully agree. The problem here is not data storage, it is efficient computations (sorting)...SQL should not even be on the list for anything time-series related anyway...

Could you explain why you are making this statement regarding SQL and time series? I'm interested to learn about specific problems. I would definitely agree if we were talking about processing streaming data, but from R1234's question it sounds he is just working on a static historical data set and asking about sorting data that could be stored in a single table...

In my opinion a database would be more scalable than the Excel- or R-based solution than he was talking about, and still pretty easy to setup. Definitely no more difficult than R

spacewiz · Aug 7, 2015

i960 said:
Technically SQL is just a standard for querying. The actual storage of data can be implemented in a multitude of ways. There's plenty of non-SQL and SQL-frontended storage libraries that could also be used (e.g.: BDB) allowing the programmer to have small record overhead and fast access. At the end of the day, it's all data.

Agree, other examples - Hadoop Hive and Amazon RedShift..

volpunter · Aug 7, 2015

...definitely more time consuming and costly than R. And why I said that SQL solutions do not lend themselves well to any time-series data is because of the relational database nature of SQL like solutions. There is a myriad of information out on the net about this very topic. You should consider columnar databases or binary data stores to handle time series. SQL based db solutions were not designed to handle time series based data well.

There are some R packages that are highly proficient in storing, retrieving, and querying time series based data. Also memory mapped solutions, in memory db or file based.

I can assure you that for proof of concept I will most likely design something around R in 1/10th the time it would take you with an SQL solution. Not because of our different skills sets, but because in R all it takes is the importing of a package, done. Just to install SQL takes a significant amount of time.

spacewiz said:
Could you explain why you are making this statement regarding SQL and time series? I'm interested to learn about specific problems. I would definitely agree if we were talking about processing streaming data, but from R1234's question it sounds he is just working on a static historical data set and asking about sorting data that could be stored in a single table...

In my opinion a database would be more scalable than the Excel- or R-based solution than he was talking about, and still pretty easy to setup. Definitely no more difficult than R

globalarbtrader · Aug 7, 2015

R1234 said:
A lot of my current strategies are cross-sectional in nature. They rank across a universe of stocks for several factors daily. Then it does a weighted average of the various factor ranks to come up with a final score for each stock.

I do this for a universe of several hundred stocks daily in excel/vba without too many problems (but it's not lightning fast).

Now I need to do the same across several thousand stocks globally and my excel/VBA framework chokes on the data. The historical daily data goes back to the 1990s and there are about 3,500 data series. So it's a pretty big dataset.

Does this sound like something that the R framework can handle with ease or will I be up against similar issues with choking and memory hogging?

I know there's a steep learning curve with R and I don't want to learn it unless I think it will be useful in this type or research.

Thanks for any insights you can give me...

So you don't know R right now, but are thinking about learning? My two cents worth is I wouldn't use R for this kind of application, and I'd spend my time learning something else.

Although in theory there is no issue, I've found there are problems with the way R allocates and then doesn't free up memory, particularly when running over loops. You need to write your code very carefully to avoid this kind of problem. As a relative novice you'll potentially waste a lot of time learning how to do this.

Where I used to work both Matlab and python were used successfully in the implementation of exactly the kind of system you have. I have a well known preference for Python so I won't trot out the pros and cons once again. Other languages are of course available.

I'd personally use a simple database for storage rather than say flat files, although as others have said this is more for robustness than to get more memory back.

One more thing; backtesting this beast could be done very easily as a parallel process (since todays CS ranking has no bearing on tommorrows). I guess in a world where most people have multiple cores on one machine some paralleisation would be done by the interpreter or complier (I'm not an expert on this) but you still have to write your code in such a way to make it possible (eg list comprehensions in python, or the equivalent). Alternatively we used to run stuff on a big cluster where we had to make the parallel stuff explicit (using something like http://www.parallelpython.com/).

Another option could be using something like https://www.quantconnect.com/ (no connection, never used it, but looks interesting) to get parallel computing power.

GAT

spacewiz · Aug 7, 2015

volpunter said:
Just to install SQL takes a significant amount of time.

volpunter, sounds interesting, also sounds like your experience with database systems comes from large enterprise environments... Installing mysql on my laptop took about as much time as installing R itself... and you don't have to find and install additional packages to create a simple flat table, load the data, and start querying it... You do have to give some thought to how to index it though... So I'm not sure I agree with the statement that the SQL approach takes more time to setup.

But I do get your point regarding problems of SQL dealing with time series in large complex system environments, esp. where you have to deal with streaming data in low latency scenarios... I just don't think we are talking about anything that complicated for this particular example

volpunter · Aug 7, 2015

agree with many of your points and precisely because OP's problem does not seem overly complex is why I recommended R. R lends itself to do some quick ranking of metrics. Python can get the job done, so can any other code implementation in different languages. I recommended R because the packages are there and one needs a 1-line command to provide access to the right package.

spacewiz said:
volpunter, sounds interesting, also sounds like your experience with database systems comes from large enterprise environments... Installing mysql on my laptop took about as much time as installing R itself... and you don't have to find and install additional packages to create a simple flat table, load the data, and start querying it... You do have to give some thought to how to index it though... So I'm not sure I agree with the statement that the SQL approach takes more time to setup.

But I do get your point regarding problems of SQL dealing with time series in large complex system environments, esp. where you have to deal with streaming data in low latency scenarios... I just don't think we are talking about anything that complicated for this particular example

blah12345678 · Aug 7, 2015

For what it's worth, I use both spacewiz's and i960's solutions:

- postgresql with heavily indexed tables
- sql for simple, frequently used queries
- suck all the data into Perl when I need to perform computational intensive calculations. s/Perl/R|Python|C|C#/ depending on your preferences....

Use the solution that's best for the task. Multiple solutions for multiple tasks, if need be. Pragmatic people are not zealots... They prefer to get shit done instead of preaching their gospel is the one true religion....

R1234 · Aug 8, 2015

globalarbtrader said:
So you don't know R right now, but are thinking about learning? My two cents worth is I wouldn't use R for this kind of application, and I'd spend my time learning something else.

Although in theory there is no issue, I've found there are problems with the way R allocates and then doesn't free up memory, particularly when running over loops. You need to write your code very carefully to avoid this kind of problem. As a relative novice you'll potentially waste a lot of time learning how to do this.

Where I used to work both Matlab and python were used successfully in the implementation of exactly the kind of system you have. I have a well known preference for Python so I won't trot out the pros and cons once again. Other languages are of course available.

I'd personally use a simple database for storage rather than say flat files, although as others have said this is more for robustness than to get more memory back.

One more thing; backtesting this beast could be done very easily as a parallel process (since todays CS ranking has no bearing on tommorrows). I guess in a world where most people have multiple cores on one machine some paralleisation would be done by the interpreter or complier (I'm not an expert on this) but you still have to write your code in such a way to make it possible (eg list comprehensions in python, or the equivalent). Alternatively we used to run stuff on a big cluster where we had to make the parallel stuff explicit (using something like http://www.parallelpython.com/).

Another option could be using something like https://www.quantconnect.com/ (no connection, never used it, but looks interesting) to get parallel computing power.

GAT

I just read your post and thought maybe you had read my mind! I hired an R programmer last week to write me the program. He will be using parallel processing in his code.

volpunter · Aug 8, 2015

That was your most valuable takeaway from the past 3 pages? Confused (but happy you got your answer you were seeking)

R1234 said:
I just read your post and thought maybe you had read my mind! I hired an R programmer last week to write me the program. He will be using parallel processing in his code.

i960 · Aug 8, 2015

Multithreaded code will potentially make things faster but you should be telling said programmer to change the algorithm - that's where the real optimization is. Then again I don't totally know what the entire problem space is you're trying to solve. It could be other things unrelated to sorting.