Correctly cleaning historical data is about using the correct process. If you use the correct process, you will be successful and even determine what success is in the first place. Here is the correct process:
Summary:
Good situation-- a configuration input file for your cleaning software which specifies the cleaning steps, the raw data input file, and the output paths for the clean data file and characterization metrics
Bad situation-- your backtesting software has some clever routines in it to filter data
- Clean in short, specific stages
- Have easily accessible data characterization metrics reported for each stage
- Be able to easily change the order of each stage
- Be able to use the data from each stage in your application (in this case, a backtest), and additionally compare multiple cleaned products from the same initial raw
Summary:
Good situation-- a configuration input file for your cleaning software which specifies the cleaning steps, the raw data input file, and the output paths for the clean data file and characterization metrics
Bad situation-- your backtesting software has some clever routines in it to filter data