Vectorized backtesting with pandas

nooby_mcnoob · Apr 30, 2019

Just wanted to submit this since I've found it useful for my own work.

TL;DR: compute returns at signal time for various holding periods and be satisfied that it's 80% accurate

One of the issues with backtesting is that it can be time consuming. Backtesting is never 100% accurate, nor should you seek it to be 100% accurate because then you're probably wrong.

I usually aim for 80-90% accuracy in anything I try because the remaining 20% will likely not give me any huge returns. If you want 100% accuracy, this will likely be useless.

In order to backtest a strategy, you have to decide what you want out of it. For me, this is:

Is this strategy worth looking into further

I don't want to wait for 10 minutes to backtest against 15 years of data, so quick turnaround is important. 15 seconds is my cutoff.

This means event-based backtesters are probably out of the question since they need to go through each bar one at a time. We need something that can work faster.

Vectorization is (basically) performing multiple operations at the same time, usually across a whole array. If you have a N-element array and want to multiply it by 3.5, there are two ways to go about it:

1. Loop through each element and multiply it by 3.5
2. Multiply each element by 3.5 at the same time

The latter requires hardware support or at the very least, can be done in native code.

With Pandas, operations are vectorized when they can be broadcast. Fortunately, many operations in Pandas are conducive to broadcasting. This includes the basic arithmetic operations. So the above operation (A*3.5) is more or less guaranteed to be done in hardware. This is what makes Pandas faster than doing things with Python arrays.

Now, in order to backtest a strategy, you need to know your entries and subsequent returns. Exits can be done this way as well, but I haven't bothered with it yet.

My process is something like this:

1. Identify positioning (position size can be done, haven't bothered): -1, np.nan, 1 as short, none, long

Code:

bars = pd.DataFrame(....)

long = bars.ema15 > bars.ema30 # or whatever
short = bars.ema15 < bars.ema30
bars['signal'] = np.nan
bars.loc[long,'signal'] = 1
bars.loc[short,'signal'] = -1

2. Identify subsequent returns

Code:

# want to look at returns after holding for N periods
for i in range(1,N+1):
    # return after holding for i periods
    # Note the negative shift: that looks into the future. OMG.
    bars[f'return_{i}'] = bars.signal*(bars.shift(-i).close - bars.close)/bars.close

3. ???

4. ???

5. Profit? Maybe? Probably not.

I often identify any columns looking into the future with a 'return_' prefix or a 'future_' prefix so I don't accidentally use them anywhere else.

Hope this is useful to someone, would love to hear any criticisms.

jharmon · Apr 30, 2019

nooby_mcnoob said:
I don't want to wait for 10 minutes to backtest against 15 years of data, so quick turnaround is important. 15 seconds is my cutoff.

Fail. It doesn't matter how long your backtest takes - the only thing that matters is that you can calculate an entry signal before you need to pull the trigger when your system is live.

nooby_mcnoob · Apr 30, 2019

jharmon said:
Fail. It doesn't matter how long your backtest takes - the only thing that matters is that you can calculate an entry signal before you need to pull the trigger when your system is live.

Is that even relevant to what I'm talking about? How?

d08 · Apr 30, 2019

Doing things slower can sometimes be beneficial, you won't win on speed anyway. It gives you time to think so you'd plan testing more carefully. That said, I use Pandas as well, not always vectorized.

nooby_mcnoob · Apr 30, 2019

d08 said:
Doing things slower can sometimes be beneficial, you won't win on speed anyway. It gives you time to think so you'd plan testing more carefully. That said, I use Pandas as well, not always vectorized.

You're not wrong but most of my time is spent thinking. I hate transcribing the ideas and then having to wait for an hour for results.

I also have code that does a more thorough backtest, this takes a long time. I only reach for this if I think I've found something.

nooby_mcnoob · Apr 30, 2019

d08 said:
Doing things slower can sometimes be beneficial, you won't win on speed anyway. It gives you time to think so you'd plan testing more carefully. That said, I use Pandas as well, not always vectorized.

What do you use pandas for? Research? Trading? Analysis? All?

d08 · Apr 30, 2019

Everything, the non-core functions are also useful. I like its efficiency but I'm not always smart enough to take full advantage of it.

nooby_mcnoob · Apr 30, 2019

d08 said:
Everything, the non-core functions are also useful. I like its efficiency but I'm not always smart enough to take full advantage of it.

It takes practice, I find jupyter invaluable for quick oneoffs.

Metamega · Apr 30, 2019

Depending on the data used, I can’t see the benefit of event driven vs vectorizesd.

If your using minute data and assuming values besides acting on open and close values, your kind of guessing entries/exits to an extent.

Unless you use tick data theirs no difference.

Using EOD data and vectorizing and using open/close or a cross of a previous bar for instance, you should get same results.

nooby_mcnoob · Apr 30, 2019

Metamega said:
Depending on the data used, I can’t see the benefit of event driven vs vectorizesd.

If your using minute data and assuming values besides acting on open and close values, your kind of guessing entries/exits to an extent.

Unless you use tick data theirs no difference.

Using EOD data and vectorizing and using open/close or a cross of a previous bar for instance, you should get same results.

Event based back testing let's you do things you can't easily do with vectorized tests. For example, building up state. Conceptually, it is also easier for other people to understand.