What is the best tool(s) to analyze data progammatically?

benvasseghi · Jun 21, 2020

Can anyone help me find a softwater programmer who specializes in finance or in the trading world.
I would greatly apprecite it.

d08 · Jun 21, 2020

931 said:
Over the years I have bodged together a C++ program that may solve many tasks you are likely to face.
I could implement your specific idea with pre existing codebase or clean it up and sell it.
It offers high flexibility if custom solution needed as almost all functions are reinventing the wheel and good speed as multithreaded fileloading, data filtering , testing etc.
Com to outer software is over socket connections, GUI is based on Qt ide,graphing on opengl.
Started the project on windows 6-7years ago, atm ported to linux due to icc compiler speed advantages and better cuda support.
Currently I am working on binary caching to disk using lz4 compression to support quickly swaping in thousands of low timeframe stocks from random locations at higher than disk speed.
There still is alot to do , more ideas for improvements than time.

But I don't offer it for free, and lots of code is not well readable or documented.
I think in systematic trading most code goes into the surrounding stuff that supports it , core algo might easily be less than 1-5% of codebase.
But timewise could be spending most on core.

I'm on a somewhat similar venture myself. I'm curious if you used Qt's native charting capabilities with openGL or something custom?

Similarly I've used threading whenever possible but that of course comes with its own headaches (race conditions).

I'm using Python myself, so while slower, it does take advantage of most same technologies plus others like Pandas and in certain situations can be as fast or faster.

931 · Jun 22, 2020

d08 said:
I'm on a somewhat similar venture myself. I'm curious if you used Qt's native charting capabilities with openGL or something custom?

Similarly I've used threading whenever possible but that of course comes with its own headaches (race conditions).

I'm using Python myself, so while slower, it does take advantage of most same technologies plus others like Pandas and in certain situations can be as fast or faster.

Yes many python libraries probably are well optimized C code, especially AI stuff.

Before qt i actually started my program with visual studio as simple openGL based stock market simulator that loads single historical data file and plays it.
Idea was making trades and learning on historical data that is speed up 5000x or whatever multiplier.
Entry and exit was up/down/right arrow keys like a game.

Programming strategies and testing on history seemed better plan and also ported to Qt ide as it had great libraries, documentation and much better/faster autocompletion compared to visual studio.

Decided to keep the lower level approach instead of qt chart or other charting library to have more flexibilty.
Lower level code has less plotting limitations, for example smooth zooming in and sliding the stock data without stepping,rendering various algos debugging info or drawing order entries etc. on top of chart.
Also i disaply bid/ask high/low bars at same time with transparency on top of eachother.because i belive bid is only half the truth. Idk if qt chart would support this.
Basically anything that can be made out of colored or textured triangles is plottable.
Since i have gamedev background , maybe one day i decide to make snake game on top of the chart to eat the stock data and reduce stress. It might not be possible with qt charts.

But so far i dont even have price levels plotted to side as i dont care about levels too much, more about ratios and patterns.
It shows bar lo/hi/open/close if mouse is on bar...

Qt's native charting capabilities are nice , i used those as well for statistics about various things like avg spread over times/weekdays , loaded data stats etc.
I am impressed by the qt 3d charting sample projects, very nice looking 3d charting, probably will use those soon for some data that benefits from more dimensions.

Race condition will not be as big problem once you work long enough with multithreading.
At beginning its pain but at some point you start to understand how things operate in parallel and where those conditions will occur to be avoided and what techniques are avalible.

Also modern cpu supports AVX512 , it can speed some parts of code up significantly without multithreading.
Basically cpu will take more than 1 math operation in single clock cycle.

I would not be suprised if in some cases optimized vectorized code is faster on single thread vs 6 or more threads.

If using and combining threading + vectorizing it could be great speedup.

Also less cache misses are important, it takes cpu longer to get data from ram vs higher speed internal caches.

d08 · Jun 22, 2020

931 said:
Yes many python libraries probably are well optimized C code, especially AI stuff.

Before qt i actually started my program with visual studio as simple openGL based stock market simulator that loads single historical data file and plays it.
Idea was making trades and learning on historical data that is speed up 5000x or whatever multiplier.
Entry and exit was up/down/right arrow keys like a game.

Programming strategies and testing on history seemed better plan and also ported to Qt ide as it had great libraries, documentation and much better/faster autocompletion compared to visual studio.

Decided to keep the lower level approach instead of qt chart or other charting library to have more flexibilty.
Lower level code has less plotting limitations, for example smooth zooming in and sliding the stock data without stepping,rendering various algos debugging info or drawing order entries etc. on top of chart.
Also i disaply bid/ask high/low bars at same time with transparency on top of eachother.because i belive bid is only half the truth. Idk if qt chart would support this.
Basically anything that can be made out of colored or textured triangles is plottable.
Since i have gamedev background , maybe one day i decide to make snake game on top of the chart to eat the stock data and reduce stress. It might not be possible with qt charts.

But so far i dont even have price levels plotted to side as i dont care about levels too much, more about ratios and patterns.
It shows bar lo/hi/open/close if mouse is on bar...

Qt's native charting capabilities are nice , i used those as well for statistics about various things like avg spread over times/weekdays , loaded data stats etc.
I am impressed by the qt 3d charting sample projects, very nice looking 3d charting, probably will use those soon for some data that benefits from more dimensions.

Race condition will not be as big problem once you work long enough with multithreading.
At beginning its pain but at some point you start to understand how things operate in parallel and where those conditions will occur to be avoided and what techniques are avalible.

Also modern cpu supports AVX512 , it can speed some parts of code up significantly without multithreading.
Basically cpu will take more than 1 math operation in single clock cycle.

I would not be suprised if in some cases optimized vectorized code is faster on single thread vs 6 or more threads.

If using and combining threading + vectorizing it could be great speedup.

Also less cache misses are important, it takes cpu longer to get data from ram vs higher speed internal caches.

I'm using pyqtgraph myself because the native QtChart seemed half-done and slow, it doesn't appear to have a lot of users. Of course certain things in pyqtgraph needed also massive improvement, data needed to be handled on-the-fly instead of loading it all to a chart and doing zoom operations on it (as is standard).
Similarly to your setup I implemented a "bar info box" and other widgets. Have not gotten to the point of openGL charts yet as that requires a complete rewrite.

Vectorizing is great but I have underestimated the memory requirements and packages such as Modin that implement threading for Pandas (relying on Dask or Ray) still run into problems, at least on Windows, because some essential pandas functions are not implemented. I spent a great deal of time trying to get vectorized backtesting to work. It's still primitive as it obviously cannot do everything what serial operations can.

Ideally, eventually, I'd like to build a box with significant amount of RAM and a Threadripper CPU with plenty of cores, so that backtesting would operate on RAM. I realize that using GPU power for number crunching might be more efficient but my proficiency in CUDA is very low. I also suppose that since the new gen of AMD processors, CPUs have gained a lot of ground if not moved ahead. Your thoughts?

userque · Jun 22, 2020

d08 said:
Ideally, eventually, I'd like to build a box with significant amount of RAM and a Threadripper CPU with plenty of cores, so that backtesting would operate on RAM.

That seems odd that back-testing can't fit in RAM.

I use minute data going back to the late 90's, and it only eats a few hundred MB's. (Win 10, C#)

d08 · Jun 22, 2020

userque said:
That seems odd that back-testing can't fit in RAM.

I use minute data going back to the late 90's, and it only eats a few hundred MB's. (Win 10, C#)

Vectorized, not serialized. All data is held in RAM, no looping. Without doubt I have some inefficiencies but even on daily data, you're talking gigabytes.

931 · Jun 22, 2020

d08 said:
I'm using pyqtgraph myself because the native QtChart seemed half-done and slow, it doesn't appear to have a lot of users. Of course certain things in pyqtgraph needed also massive improvement, data needed to be handled on-the-fly instead of loading it all to a chart and doing zoom operations on it (as is standard).
Similarly to your setup I implemented a "bar info box" and other widgets. Have not gotten to the point of openGL charts yet as that requires a complete rewrite.

Vectorizing is great but I have underestimated the memory requirements and packages such as Modin that implement threading for Pandas (relying on Dask or Ray) still run into problems, at least on Windows, because some essential pandas functions are not implemented. I spent a great deal of time trying to get vectorized backtesting to work. It's still primitive as it obviously cannot do everything what serial operations can.

Ideally, eventually, I'd like to build a box with significant amount of RAM and a Threadripper CPU with plenty of cores, so that backtesting would operate on RAM. I realize that using GPU power for number crunching might be more efficient but my proficiency in CUDA is very low. I also suppose that since the new gen of AMD processors, CPUs have gained a lot of ground if not moved ahead. Your thoughts?

From what i have been reading Qt/Python is bit experimental and has smaller userbase.
If Qt classes of Python and C++ would be easily combinable i would probably start to use python for all prototyping and many other tasks.

Another option for gpu is openACC but learning cuda can have better results.

Workstation pc is good idea and could use 512gb or more ram to load in many stocks at once for quick access,
If you need lots of data, one option is disk buffering and using ssd over m2 interface as it can be more than 8x faster from sata , some M.2 SSD models claim read speeds of 5000 MBytes/sec and 3tb+ capacity.

userque said:
That seems odd that back-testing can't fit in RAM.

I use minute data going back to the late 90's, and it only eats a few hundred MB's. (Win 10, C#)

That seems similar to my usages if using floating data type and bid prices.

d08 · Jun 22, 2020

931 said:
Are you loading in single stock with daily data or hundreds/thousands of stocks?

Thousands of stocks, I tend to operate on whole "universe" or at least over 90% of it. The important part in my use case is peak memory, which can be very high. In Pandas there are copies made of the whole data set when using many functions, so mem usage ballooning is to be expected. I'm sure optimizing the code for months on end is possible but I'm thinking in the end, might be more efficient to just build a workstation to handle it all.

When dropping to lower timescales, I'll likely have to use Dask which processes a dataframe as segments. Yet some of the functions from Pandas I use are not yet implemented. The ideal would be to store data in Parquet or Feather, load into memory if able, filter out unused data at different steps during the backtest and process results. Or just serialize everything and sleep soundly...
But like you said, there's so many ideas but so little time.

931 · Jun 22, 2020

d08 said:
Thousands of stocks, I tend to operate on whole "universe" or at least over 90% of it. The important part in my use case is peak memory, which can be very high. In Pandas there are copies made of the whole data set when using many functions, so mem usage ballooning is to be expected. I'm sure optimizing the code for months on end is possible but I'm thinking in the end, might be more efficient to just build a workstation to handle it all.

When dropping to lower timescales, I'll likely have to use Dask which processes a dataframe as segments. Yet some of the functions from Pandas I use are not yet implemented. The ideal would be to store data in Parquet or Feather, load into memory if able, filter out unused data at different steps during the backtest and process results. Or just serialize everything and sleep soundly...
But like you said, there's so many ideas but so little time.

It makes sense to load many instruments if using daily timeframe. 365 days*10 years is 3650 data points but stocks are not even active all days.
256 or 512gb ram could probably store 100k plus stocks in daily resolution.
But many of those might be pennystocks and with high spreads.
Idk how many stocks that have low spreads there actually are.
Penny stocks have too high spreads in my opinion.

d08 · Jun 22, 2020

931 said:
It makes sense to load many instruments if using daily timeframe. 365 days*10 years in 3650 data points but stocks are not active all days.
256 or 512gb ram will probably store 50 000 plus stocks in daily resolution.

There's all sorts of overhead it seems. So it's something like 5000 stocks * 252 trading days per year * 10 years = 12.6 million data points if only looking at closes. Looking at OHLCV, it's already at 63 million.