pandas too slow for event driven backtesting

Marshall · Feb 10, 2022

Hi looking for some suggestions for how to store minute resolution data in memory for a better performance for backtesting. I am writing my own event-based backtesting framework in python, just for fun. The data is read into memory as a whole, stored as a pandas dataframe and then indexing will be performed with .loc[] on this dataframe during the backtest main loop.

The dataframe has two levels of multiindex, the datetime level and the symbol level. And on every time change, I will have to perform an indexing operation to get the historical 1 min data for certain stocks in certain time periods(the time periods is monotonically increasing).

My main issue is that the pandas indexing/slicing operation is way too slow and I know there is no pefect remedy for this operation.

So I wonder if there is any suggested alternatives for those kinds of purposes, maybe I should not store my minute data as a pandas dataframe. Is numpy ndarray or structured array a better choice? But then there will be no easy solution for time series indexing.

R1234 · Feb 10, 2022

looks like you might gain speed by vectorizing your code if at all possible.
I avoid loops in python.

PoundTheRock · Feb 10, 2022

https://numpy.org/doc/stable/reference/arrays.datetime.html

Zwaen · Feb 10, 2022

Aren’t tibbles the way to go ( instead of dataframes)?

d08 · Feb 10, 2022

So you're looping with .loc? That seems like a terrible idea. At least try to use itertuples if vectorizing is out of the question. What I did was to use vectorized solutions (numpy) where possible and elsewhere switch to numba and within numba compiled functions use numpy where possible and sensible.

Marshall · Feb 10, 2022

d08 said:
So you're looping with .loc? That seems like a terrible idea. At least try to use itertuples if vectorizing is out of the question. What I did was to use vectorized solutions (numpy) where possible and elsewhere switch to numba and within numba compiled functions use numpy where possible and sensible.

yes, I think numpy ndarray is the correct direction.
I have timed the speed of numpy and pandas indexing. For numpy, indexing takes at the level of 100 nanoseconds to 1microsecond but for pandas the labeled indexing method .loc[] could take 100 microseconds to up to 10 milliseconds. This is about 1000 times faster then pandas indexing!!
But I am still not sure how to switch to numpy ndarray, as it only supports integer location based indexing. So if I want to performed labeled based indexing, like indexing certain time periods and certain assets as I did in pandas using .loc[]. Not sure how to achieve this goal in numpy.

Marshall · Feb 10, 2022

PoundTheRock said:
https://numpy.org/doc/stable/reference/arrays.datetime.html

I know the datetime64 data type in numpy. But this is not helping. I am not trying to perform arithmatic calculation on datetime. Just trying to find a better way to get time series data during backtesting

guest_trader_1 · Feb 11, 2022

I will never understand why people use Python for trading.

Marshall · Feb 11, 2022

not everyone knows how to write c/c++. python is not the most efficient way, but it is convenient.

guest_trader_1 · Feb 11, 2022

Marshall said:
not everyone knows how to write c/c++. python is not the most efficient way, but it is convenient.

Also, I will never understand why people think that Python is easier than c++ or c#.
The amount of effort that you need to get things done in Python would allow you to learn any language.