Vectorized backtesting with pandas

jharmon · May 1, 2019

IAS_LLC said:
If speed is your biggest concern....

nooby_mcnoob said:
Not so far... But good to know about it, thanks!

Seems to be a big concern of yours given you want to backtest 10 years of data in 15 seconds - your first post.

15 seconds per security is too long. 15 seconds for several thousand securities is difficult without some serious thought to parallel processing.

Hard to figure out what you are really wanting.

nooby_mcnoob · May 1, 2019

jharmon said:
Seems to be a big concern of yours given you want to backtest 15 years of data in 15 seconds - your first post.

Hard to figure out what you are really wanting.

Well clearly what I did was good enough for me. Haven't needed to go beyond. Hard to know if I need to spell out everything for everyone

jharmon · May 1, 2019

Err OK - still not sure what your universe is.

nooby_mcnoob · May 1, 2019

Why is there always someone litigating words on these threads

jharmon · May 1, 2019

nooby_mcnoob said:
Why is there always someone litigating words on these threads

Your blanket statement about backtesting taking a maximum of 15 seconds started it dude.

Explain yourself - what are you trading? 1 stock/security? 3000? All listed stocks? A handful of futures contracts? Clearly if you are talking performance you need to give some further metrics on your universe.

nooby_mcnoob · May 1, 2019

jharmon said:
Your blanket statement about backtesting taking a maximum of 15 seconds started it dude.

Explain yourself - what are you trading? 1 stock/security? 3000? All listed stocks? A handful of futures contracts? Clearly if you are talking performance you need to give some further metrics on your universe.

I started this thread not to talk about what I need, but how I solved a technical problem that may be useful to others. It is a tool in your belt and not a holy grail. Why you keep going on your pet tangent is probably something to do with you.

PoundTheRock · May 5, 2019

Most of these answers are crap. Either use the multiprocessing package (groupby vectorization for multiple cores) or Dask.

nooby_mcnoob · May 5, 2019

PoundTheRock said:
Most of these answers are crap. Either use the multiprocessing package (groupby vectorization for multiple cores) or Dask.

Irony of ironies I started doing this yesterday after claiming I didn't need it. Thanks ET for continually calling me out.

GRULSTMRNN · May 9, 2019

May I recommend you to go back to the drawing board and rethink your entire premise. You want to vectorize a backtest. Vectorization only works when a process is not path dependent. So, you can vectorize a backtest over multiple symbols given that the performance of each symbol is independent of the performance of the other symbols and that there are no other dependencies. So, you iterate over your data set and in parallel feed data points to each individual backtest that consists of a symbol. You cannot vectorize the data feed itself because you then make the assumptions that performance tomorrow is independent of performance today, which is inherently wrong in financial trading. Imagine you enter into a position today, tomorrow your backtest needs to know that you are in position, else your entire backtest will be flawed.

Long story short, you cannot vectorize a path-dependent process.

nooby_mcnoob said:
Just wanted to submit this since I've found it useful for my own work.

TL;DR: compute returns at signal time for various holding periods and be satisfied that it's 80% accurate

One of the issues with backtesting is that it can be time consuming. Backtesting is never 100% accurate, nor should you seek it to be 100% accurate because then you're probably wrong.

I usually aim for 80-90% accuracy in anything I try because the remaining 20% will likely not give me any huge returns. If you want 100% accuracy, this will likely be useless.

In order to backtest a strategy, you have to decide what you want out of it. For me, this is:

I don't want to wait for 10 minutes to backtest against 15 years of data, so quick turnaround is important. 15 seconds is my cutoff.

This means event-based backtesters are probably out of the question since they need to go through each bar one at a time. We need something that can work faster.

Vectorization is (basically) performing multiple operations at the same time, usually across a whole array. If you have a N-element array and want to multiply it by 3.5, there are two ways to go about it:

1. Loop through each element and multiply it by 3.5
2. Multiply each element by 3.5 at the same time

The latter requires hardware support or at the very least, can be done in native code.

With Pandas, operations are vectorized when they can be broadcast. Fortunately, many operations in Pandas are conducive to broadcasting. This includes the basic arithmetic operations. So the above operation (A*3.5) is more or less guaranteed to be done in hardware. This is what makes Pandas faster than doing things with Python arrays.

Now, in order to backtest a strategy, you need to know your entries and subsequent returns. Exits can be done this way as well, but I haven't bothered with it yet.

My process is something like this:

1. Identify positioning (position size can be done, haven't bothered): -1, np.nan, 1 as short, none, long

Code:

bars = pd.DataFrame(....) long = bars.ema15 > bars.ema30 # or whatever short = bars.ema15 < bars.ema30 bars['signal'] = np.nan bars.loc[long,'signal'] = 1 bars.loc[short,'signal'] = -1

2. Identify subsequent returns

Code:

# want to look at returns after holding for N periods for i in range(1,N+1): # return after holding for i periods # Note the negative shift: that looks into the future. OMG. bars[f'return_{i}'] = bars.signal*(bars.shift(-i).close - bars.close)/bars.close

3. ???

4. ???

5. Profit? Maybe? Probably not.

I often identify any columns looking into the future with a 'return_' prefix or a 'future_' prefix so I don't accidentally use them anywhere else.

Hope this is useful to someone, would love to hear any criticisms.

GRULSTMRNN · May 9, 2019

what exactly are you vectorizing? As I mentioned in my above post, you can vectorize multiple independent backtests over identical data sets but you cannot vectorize an individual data set that is path dependent over one backtest. I stated the proof why that is the case in my above post.

R1234 said:
Gotta vectorize. I did an initial backtest in Python with lots of large dataframes using looping. That took almost an hour to complete.

Then I vectorized everything and it ran in 4 minutes.