pandas too slow for event driven backtesting

Hi looking for some suggestions for how to store minute resolution data in memory for a better performance for backtesting. I am writing my own event-based backtesting framework in python, just for fun. The data is read into memory as a whole, stored as a pandas dataframe and then indexing will be performed with .loc[] on this dataframe during the backtest main loop.

The dataframe has two levels of multiindex, the datetime level and the symbol level. And on every time change, I will have to perform an indexing operation to get the historical 1 min data for certain stocks in certain time periods(the time periods is monotonically increasing).

My main issue is that the pandas indexing/slicing operation is way too slow and I know there is no pefect remedy for this operation.

So I wonder if there is any suggested alternatives for those kinds of purposes, maybe I should not store my minute data as a pandas dataframe. Is numpy ndarray or structured array a better choice? But then there will be no easy solution for time series indexing.
 
So you're looping with .loc? That seems like a terrible idea. At least try to use itertuples if vectorizing is out of the question. What I did was to use vectorized solutions (numpy) where possible and elsewhere switch to numba and within numba compiled functions use numpy where possible and sensible.
 
So you're looping with .loc? That seems like a terrible idea. At least try to use itertuples if vectorizing is out of the question. What I did was to use vectorized solutions (numpy) where possible and elsewhere switch to numba and within numba compiled functions use numpy where possible and sensible.
yes, I think numpy ndarray is the correct direction.
I have timed the speed of numpy and pandas indexing. For numpy, indexing takes at the level of 100 nanoseconds to 1microsecond but for pandas the labeled indexing method .loc[] could take 100 microseconds to up to 10 milliseconds. This is about 1000 times faster then pandas indexing!!
But I am still not sure how to switch to numpy ndarray, as it only supports integer location based indexing. So if I want to performed labeled based indexing, like indexing certain time periods and certain assets as I did in pandas using .loc[]. Not sure how to achieve this goal in numpy.
 
not everyone knows how to write c/c++. python is not the most efficient way, but it is convenient.
Also, I will never understand why people think that Python is easier than c++ or c#.
The amount of effort that you need to get things done in Python would allow you to learn any language.
 
Back
Top