Quote from thstart:
(snip)
I believe we created an innovative way for:
1) Analysis of time series based on
1.1) OHLC data discovering periodicity relationships in one stock or more stocks.
1.1.1) A sliding window calculations of 5,10,20,40,80,160,240,360 trading days. These calculations are based on regression on each of OHLC with a very fast proprietary algorithm. Also there are 5 more calculations based on these - their Total, and several combinations of them. Also a calculations of the difference between the price and these values for each trading day.
1.1.2) A rate of change calculations on each of OHLC and 1.1.1) with a very fast proprietary algorithm.
1.2) Volume data - a rate of change calculations
1.3) Options volume data - a rate of change calculations for 3 groups - Market Makers, Firms, Retail.
1.4) Insider trading info with aggregate total number of shares info rated and divided depending of the type of each participant - Officer, Director, etc.
1.5) Bank information - the ratio between deposits and loans, etc.
1.6) Interest rates - rate of change
1.7) Economic indicators - rate of change and similar to 1.1 computations.
1.8) OHLC relationships in the same day and a day before-
Like (H-L)/O, (H-L)/O1DayAgo and many others - Pivot Points for example.
1-7 are relative long term, 3, 8 - a short term indicators.
I believe they give a good mix of data
influencing the trading decision from different independent point of views.
(snip)
I can't say I am 100% following all that you say. You are saying quite a bit in a short amount of discussion, but with multiple read throughs I think I am catching most of it.
It seems that much of what you are describing is actually expressing an opinion about what types of data you think is useful for creating a stock trading strategy. Not that there is anything wrong with that topic, but there are two separate topics here. First, input data and calculated indicators. Second, backtesting a specific trading strategy that uses the input data and calculated indicators.
Other strategy researchers I am sure will have other ideas about what input data they need as input to their backtesting and scans. The 1,000 columns of data that you discuss is what you are using for your research, but the next strategy researcher will have completely different ideas about what they want as input to backtesting stock strategies and scanning.
Also (in your full original post that I trimmed down quoted above), you discuss a specific analysis strategy, as well as some specific user interface. All of this is a very wide discussion, where the topic I am interested in discussing is specifically the data handling, computation speed, and strategy backtesting (and scanning) capability part of this. That seems to be the main focus of this thread. It is the part of the thread that interests me.
So, into that topic...
It seems you are saying that what you do is to generate large quantities of data (the ~1,000 columns), then you query the data to do scans and backtests. Certainly this is an approach, but I think not the only approach to the task.
Let me say that something which has always helped me with coding the rules of trading strategies. I tell myself, think bar by bar like the decisions are being made in real time actual trading. Sometimes there are decisions to be made at multiple points in the bar. On the open, during the bar, after the close. Write the trading rules code to do processing in sequential order as if you were trading the strategy in real time. After all, the only goal of this is the ability to trade a strategy in real time. Any decision that actually can be made in real time trading can by definition also be made at a point in time in simulation of trading.
Do this one day at a time. Move forward to the next day. I find that thinking in this way helps me to code complex strategies.
From this, another way to approach your task (as opposed to generating 1,000 columns of data then querying the data) would be to start with the trading rules of the trading strategy or the criteria for a specific scan. Write code to process only the specific data needed for a specific backtest or scan (or write code to process lots of various data with user interface options enabling processing of a subset of data for a specific backtest, similar to your "cabd", 5% Gain 5 Days after D-Day" options). Write code to move through the data forward from first date to last date as if you were actually trading. With this approach, much of your 1,000 columns of data could be dynamically created on the fly, processed, discarded. Then move on to the next bar. This would greatly reduce the quantity of stored data.
What I am describing is how most backtesting software does it.
I suspect that no single strategy or scan would actually consume 1,000 items of information in a single strategy. I can't see how a single trading strategy could consume that much input data, and you seem to also be saying that in your post talking about ""cabd", 5% Gain 5 Days after D-Day". I interpret this to mean you are saying that any single test or scan will focus on a subset of factors like you are listing.
So the alternative approach might be to dynamically calculate only the data desired for the specific backtest or scan bar by bar as I describe above (as well as some of the fundamental data you mention will come from the disk). Move forward, calculate much of this input data on the fly, consume the calculated data, discard calculated data you are finished with, move on to the next bar. This is how I have approached this type of task, and it works very well and with very good performance.
What you refer to as "sliding window calculations of 5,10,20,40,80,160,240,360 trading days" is I believe what I would term an indicator of these various lengths. It is no problem to calculate multiple indicator lengths on the fly bar by bar.
I am coming from a completely different direction on this whole topic versus all the other posters. All of this discussion about how to construct tables to get performance out of SQL. High speed server Dell PowerEdge 2850 2XCPU XEON 3 GHZ, 4 GB RAM, RAID 2 x 75 GB HDD, SIMD, NVIDIA CUDA GPU parallel computing, up to 240 processors, SSDs
This whole thread strikes me as a very detailed explanation based upon extensive experience why SQL is not the right tool for the job of processing portfolios of market data. I have encountered this before talking to others. SQL seems like a great idea at first because much of the low level nuts and bolts work is already done for you. But what you point out in detail is what I have heard before, that performance is a huge problem.
I have instead taken the approach of custom data handling hand coded from scratch in C++ specifically for time series market data, and I get excellent performance on this type of processing, even with a consumer level desktop or laptop computer with 1 GB to 3 GB of memory.
What is the flaw in what I say? I am proposing that it is a better approach.
You have briefly mentioned quantity of market data and various performance times with various approaches you have tried, but I am unclear about the specifics. As a performance benchmark: For example, how many stocks do you run the strategy against? How many years is the time span of the backtest? How long does that backtest take, and on what kind of hardware is that backtest running?
- Bob Bolotin