Hi Bob,
thanks for your response.
I believe what we can create as a trading strategy can based only the available data. The list above covers a lot of different angles and gives a good enough picture for creating a trading strategy based on 3-5 days period from buying to selling.
I believe if we are basing our decisions on ~same data, the trading decisions are a matter of intepretation. It depends of money management issues too. So potentially there can be a trading strategies as many traders there are.
My idea is that everybody creates its own trading strategy. What is needed is a easy way to try and test different strategies. At the end it comes to the input data you have. If you slice and dice enough the different independent sources of data
you can later play with more variables.
The problem with existing tools is you cannot do this efficiently.
I agree
Yes this is one approach. If you pre generate ~1,000 columns in advance - the permutations you will need later - it is much easier for the next phase - the actual screening and back test.
There is not needed to generate all possible permutation, just that make a sense later. If you have them pre computer performance wise this is the fastest way later to screen and back test. Why? Because you pre generate one time, but screen and back test many times. If you generate your data dynamically every time when you screen and back test then the performance is not the best. At the end it comes down for the need of better database engine specific for these kind of tasks.
This is very true. I would add that when I have all data sliced and diced with our software I can go back in time and inspect bar by bar all factors I pre calculated in advance for every trading day as they occurred.
Absolutely - this is a very simple and effective way to analyse the history and that is how we are working too.
What I needed too is after analyzing the factors day bay day to try and find similar events using the same factors - e.g. to make a screener with the factors as parameters. What is more I needed a screener working on the all history available. The screeners we analyzed are working only on today's data. That is because the amount of data is too much to store for all the history available. The problem is that I cannot know if the screening criteria are working if I cannot go back in history to see what happened next. This is different than back testing. For back testing you are testing a specific strategy after you formulated it. I consider the screening the phase where you are still not committed to specific strategy but are still researching for formulating a better strategy. With back testing you can test it and they have to help each other and to easily switch between screening and back testing.
I agree, the only difference is that our approach is data driven. I don't want to write a code for each strategy I am testing. It is much more flexible and more productive If the end user just picks for example with check boxes which pre calculated data he will will use and formulate the relationships between these data. All these ~1,000 columns - it is not practical and not efficient to use all of them in specific screener and back testing, but they are lot enough to pick from to test one or other idea. The trick is to choose to compute just enough permutations not all of them. This is made one time, but used many times and only part of this data is used in one particular time. The discussion about SQL was that SQL prevents this way of data to be organized and used effectively.
I believe screener and back testing working together on much more data as I described them is better idea.
I agree it is a fast and effective approach for small amount of data. If you calculate data on the fly you eliminate the DB data handling in case of pre generated data.
From other side it is slow. That is why we do it using parallel computing SIMD on CPU and CUDA on NVIDIA. The latest NVIDIA cards have a lot of memory and we tested an approach holding all data in memory. The computations are in order of magnitude faster than CPU so that this is a feasible approach. The problems is - not all kind of computations are appropriate for GPU, some of them are more suited for SIMD CPU vectorized computation. But it is not so fast as NVIDIA CUDA based. So we have a hybrid approach - pre calculated part of data using fast parallel SIMD CPU programming, leave most often used on the fly computations on NVIDIA CUDA and that way we have the best if both worlds. Combining this with an effective hybrid column based database we cover the effective data storage optimized for this kind of data too.
It not a problem but when you have more instruments and more history it is not fast enough and you cannot make a screener that way. If you make a screener that way this is basically sequential scanning day by day, instrument by instrument and it it would be slow.
thanks for your response.
...You are saying quite a bit in a short amount of discussion...
It seems that much of what you are describing is actually expressing an opinion about what types of data you think is useful for creating a stock trading strategy.
I believe what we can create as a trading strategy can based only the available data. The list above covers a lot of different angles and gives a good enough picture for creating a trading strategy based on 3-5 days period from buying to selling.
...Other strategy researchers I am sure will have other ideas about what input data they need as input to their backtesting and scans. ...
I believe if we are basing our decisions on ~same data, the trading decisions are a matter of intepretation. It depends of money management issues too. So potentially there can be a trading strategies as many traders there are.
My idea is that everybody creates its own trading strategy. What is needed is a easy way to try and test different strategies. At the end it comes to the input data you have. If you slice and dice enough the different independent sources of data
you can later play with more variables.
The problem with existing tools is you cannot do this efficiently.
While indeed out of topic this is important too. You have to be able to analyse, interpret and visualize all of this data.
Also (in your full original post that I trimmed down quoted above), you discuss a specific analysis strategy, as well as some specific user interface.
All of this is a very wide discussion, where the topic I am interested in discussing is specifically the data handling, computation speed, and strategy backtesting (and scanning) capability part of this. That seems to be the main focus of this thread. It is the part of the thread that interests me.
So, into that topic...
I agree
It seems you are saying that what you do is to generate large quantities of data (the ~1,000 columns), then you query the data to do scans and backtests. Certainly this is an approach, but I think not the only approach to the task.
Yes this is one approach. If you pre generate ~1,000 columns in advance - the permutations you will need later - it is much easier for the next phase - the actual screening and back test.
There is not needed to generate all possible permutation, just that make a sense later. If you have them pre computer performance wise this is the fastest way later to screen and back test. Why? Because you pre generate one time, but screen and back test many times. If you generate your data dynamically every time when you screen and back test then the performance is not the best. At the end it comes down for the need of better database engine specific for these kind of tasks.
Let me say that something which has always helped me with coding the rules of trading strategies. I tell myself, think bar by bar like the decisions are being made in real time actual trading. Sometimes there are decisions to be made at multiple points in the bar. On the open, during the bar, after the close. Write the trading rules code to do processing in sequential order as if you were trading the strategy in real time. After all, the only goal of this is the ability to trade a strategy in real time. Any decision that actually can be made in real time trading can by definition also be made at a point in time in simulation of trading.
This is very true. I would add that when I have all data sliced and diced with our software I can go back in time and inspect bar by bar all factors I pre calculated in advance for every trading day as they occurred.
Do this one day at a time. Move forward to the next day. I find that thinking in this way helps me to code complex strategies.
Absolutely - this is a very simple and effective way to analyse the history and that is how we are working too.
What I needed too is after analyzing the factors day bay day to try and find similar events using the same factors - e.g. to make a screener with the factors as parameters. What is more I needed a screener working on the all history available. The screeners we analyzed are working only on today's data. That is because the amount of data is too much to store for all the history available. The problem is that I cannot know if the screening criteria are working if I cannot go back in history to see what happened next. This is different than back testing. For back testing you are testing a specific strategy after you formulated it. I consider the screening the phase where you are still not committed to specific strategy but are still researching for formulating a better strategy. With back testing you can test it and they have to help each other and to easily switch between screening and back testing.
Write code to process only the specific data needed for a specific backtest or scan (or write code to process lots of various data with user interface options enabling processing of a subset of data for a specific backtest, similar to your "cabd", 5% Gain 5 Days after D-Day" options). Write code to move through the data forward from first date to last date as if you were actually trading. With this approach, much of your 1,000 columns of data could be dynamically created on the fly, processed, discarded. Then move on to the next bar. This would greatly reduce the quantity of stored data.
I agree, the only difference is that our approach is data driven. I don't want to write a code for each strategy I am testing. It is much more flexible and more productive If the end user just picks for example with check boxes which pre calculated data he will will use and formulate the relationships between these data. All these ~1,000 columns - it is not practical and not efficient to use all of them in specific screener and back testing, but they are lot enough to pick from to test one or other idea. The trick is to choose to compute just enough permutations not all of them. This is made one time, but used many times and only part of this data is used in one particular time. The discussion about SQL was that SQL prevents this way of data to be organized and used effectively.
...enabling processing of a subset of data for a specific backtest, similar to your "cabd", 5% Gain 5 Days after D-Day" options)...
What I am describing is how most backtesting software does it.
I believe screener and back testing working together on much more data as I described them is better idea.
So the alternative approach might be to dynamically calculate only the data desired for the specific backtest or scan bar by bar as I describe above (as well as some of the fundamental data you mention will come from the disk). Move forward, calculate much of this input data on the fly, consume the calculated data, discard calculated data you are finished with, move on to the next bar. This is how I have approached this type of task, and it works very well and with very good performance.
I agree it is a fast and effective approach for small amount of data. If you calculate data on the fly you eliminate the DB data handling in case of pre generated data.
From other side it is slow. That is why we do it using parallel computing SIMD on CPU and CUDA on NVIDIA. The latest NVIDIA cards have a lot of memory and we tested an approach holding all data in memory. The computations are in order of magnitude faster than CPU so that this is a feasible approach. The problems is - not all kind of computations are appropriate for GPU, some of them are more suited for SIMD CPU vectorized computation. But it is not so fast as NVIDIA CUDA based. So we have a hybrid approach - pre calculated part of data using fast parallel SIMD CPU programming, leave most often used on the fly computations on NVIDIA CUDA and that way we have the best if both worlds. Combining this with an effective hybrid column based database we cover the effective data storage optimized for this kind of data too.
What you refer to as "sliding window calculations of 5,10,20,40,80,160,240,360 trading days" is I believe what I would term an indicator of these various lengths. It is no problem to calculate multiple indicator lengths on the fly bar by bar.
It not a problem but when you have more instruments and more history it is not fast enough and you cannot make a screener that way. If you make a screener that way this is basically sequential scanning day by day, instrument by instrument and it it would be slow.

)...but I'm still interested in reading about it.