What do you do with your missing data?

nijshar28 · Jun 25, 2020

I wonder how everyone handles missing data during strategy development.

Here, I mostly refer to the EOD close data that is missing due to the fact that the company is delisted, not listed yet, or simply not trading for a period of time.

I have seen some quant researchers recommend forward filling the missing data first, then backfilling.

However, to me it makes more sense to fill the missing price data with 0s. This way I can tell my algorithm to exclude those 0-priced tickers from the trading universe. So, if I am not holding such tickers, I would not be able to open positions in them. And if I have positions open then I would not be able to close / alter them. I will also incur a 100% loss on those positions (100% gain on shorts) that is either temporary, or permanent depending on if it is a delisting, or a halt. Does this approach make sense?

Is there a reason why forward-filling / back-filling is preferable to the 0-fill? Is there another way to handle missing data I am not aware of?

Thank you.

stepandfetchit · Jun 25, 2020

nijshar28 said:
I wonder how everyone handles missing data during strategy development.

Here, I mostly refer to the EOD close data that is missing due to the fact that the company is delisted, not listed yet, or simply not trading for a period of time.

I have seen some quant researchers recommend forward filling the missing data first, then backfilling.

However, to me it makes more sense to fill the missing price data with 0s. This way I can tell my algorithm to exclude those 0-priced tickers from the trading universe. So, if I am not holding such tickers, I would not be able to open positions in them. And if I have positions open then I would not be able to close / change them. I will also incur a 100% loss on those positions (100% gain on shorts) that is either temporary, or permanent depending on if it is a delisting, or a halt. Does this approach make sense?

Is there a reason why forward-filling / back-filling is preferable to the 0-fill? Is there another way to handle missing data I am not aware of?

Thank you.

IMO: The proper "handling" of bad data and missing data will migrate over time. My first approach was to "modify/correct" the data, such that it would be appropriate for forward testing. This was painful, and for some historic cases, impossible to resolve. I am now migrating to processing the deck that was dealt, which requires a different mindset(for me), but seems to be more productive. (less effort addressing hypothetical cases) I handle data in 60-sec intervals on options on only a few underlyings (mostly Indexes), so may have different issues than you.

globalarbtrader · Jun 25, 2020

nijshar28 said:
I wonder how everyone handles missing data during strategy development.

Here, I mostly refer to the EOD close data that is missing due to the fact that the company is delisted, not listed yet, or simply not trading for a period of time.

I have seen some quant researchers recommend forward filling the missing data first, then backfilling.

However, to me it makes more sense to fill the missing price data with 0s. This way I can tell my algorithm to exclude those 0-priced tickers from the trading universe. So, if I am not holding such tickers, I would not be able to open positions in them. And if I have positions open then I would not be able to close / alter them. I will also incur a 100% loss on those positions (100% gain on shorts) that is either temporary, or permanent depending on if it is a delisting, or a halt. Does this approach make sense?

Is there a reason why forward-filling / back-filling is preferable to the 0-fill? Is there another way to handle missing data I am not aware of?

Thank you.

Don't use zeros! Use NAN or None or something like that. You never know if the price is actually zero (cf crude oil futures...). Forward filling can make sense, but it should be done as late as possible for the reasons you've described. Eg once you've calculated a signal using a price series with missing elements, then forward fill the signal.

GAT

nijshar28 · Jun 25, 2020

globalarbtrader said:
Don't use zeros! Use NAN or None or something like that. You never know if the price is actually zero (cf crude oil futures...). Forward filling can make sense, but it should be done as late as possible for the reasons you've described. Eg once you've calculated a signal using a price series with missing elements, then forward fill the signal.

GAT

Hey. Thanks for your input. I only trade stocks. Not sure if your concern about not knowing what the current price is is only relevant to derivatives?

My issue with forward-filling is that my algo might be trying to trade something which is not tradable at the moment.

globalarbtrader · Jun 25, 2020

nijshar28 said:
Hey. Thanks for your input. I only trade stocks. Not sure if your concern about not knowing what the current price is is only relevant to derivatives?

My issue with forward-filling is that my algo might be trying to trade something which is not tradable at the moment.

I'd still avoid using 0. Also https://en.wikipedia.org/wiki/Magic_number_(programming)#Unnamed_numerical_constants. Ideally if you can, I think it's cleanest to mark something as 'no price', or 'untradeable' or 'bust'. A missing price could be for any of these reasons, and you should treat them seperately.

GAT

nijshar28 · Jun 25, 2020

globalarbtrader said:
I'd still avoid using 0. Also https://en.wikipedia.org/wiki/Magic_number_(programming)#Unnamed_numerical_constants. Ideally if you can, I think it's cleanest to mark something as 'no price', or 'untradeable' or 'bust'. A missing price could be for any of these reasons, and you should treat them seperately.

GAT

I see your point.

I cannot calculate my signal with NaNs present though. So that unfortunately is a non-starter.

The 0 price fill has the advantage of accurately reflecting losses on delisted holdings, which I think is the most frequent reason for the data to go missing. Sometimes it is an acquisition though, in which case forward-fill makes the most sense.

Maybe a better way would be to monitor volume and exclude tickers with zero volume from trading, while forward filling price? But that mishandles delistings.

I guess there isn't a one-size-fits-all solution here.

guru · Jun 25, 2020

I skip trading those dates, but I do create volumes of missing OHLCV data (thousands of years) for Monte Carlo tests.

Snuskpelle · Jun 25, 2020

nijshar28 said:
I see your point.

I cannot calculate my signal with NaNs present though. So that unfortunately is a non-starter.

The 0 price fill has the advantage of accurately reflecting losses on delisted holdings, which I think is the most frequent reason for the data to go missing. Sometimes it is an acquisition though, in which case forward-fill makes the most sense.

Maybe a better way would be to monitor volume and exclude tickers with zero volume from trading, while forward filling price? But that mishandles delistings.

I guess there isn't a one-size-fits-all solution here.

I would recommend keeping the NaNs in there and then do filtering/mapping just before passing into the consumer function. It's likely that different consumers need different interpretations of missing data. Of course, in very performance intensive code this might not be tolerable.

MarkBrown · Jun 25, 2020

i would sell mine but it's missing so...

ph1l · Jun 25, 2020

According to Machine Learning for Trading (free course) at https://www.udacity.com/course/machine-learning-for-trading--ud501,

For missing data between or after existing data, fill forward (assume the last data continues without change until the next non-missing data) because it's usually not a good idea to peek into the future.
For missing data before any existing data, fill backward (assume first data point goes backward without changing).