FWIW, I can still easily download a few thousand EOD symbols from Yahoo finance no problem. I have a set of about 1900 ETFs (csv/zoo format). But from casual inspection, there are lots of gaps and outliers everywhere.
I think it is cleanable and can imagine a simple script to adjust for improper splits/gaps, by simply back stitching over at least large gaps (similar to stitching futures contracts). I think that outliers could also be reasonably cleaned by an automated script. One time consuming aspect would be visually verifying/validating and troubleshooting against some reliable dataset/data provider. I have data gathering scripts written in R and could possibly work on it in other languages.
If any individuals have an interest in tackling it with a divide and conquer approach (at least for say the largest 500 ETFS for example)... let me know.
At the very least, I can send you a small set of csv files to work on. Who knows how long the data will be available; this isn't really dealing with ongoing data, but at least you'd hopefully have a cleaner dataset to backtest over.
I've seen similar issues with Quandl data, so I think a good script cleaner would always be useful. And as others have pointed out, I don't see a lot of historical ETF data providers out there.
I think it is cleanable and can imagine a simple script to adjust for improper splits/gaps, by simply back stitching over at least large gaps (similar to stitching futures contracts). I think that outliers could also be reasonably cleaned by an automated script. One time consuming aspect would be visually verifying/validating and troubleshooting against some reliable dataset/data provider. I have data gathering scripts written in R and could possibly work on it in other languages.
If any individuals have an interest in tackling it with a divide and conquer approach (at least for say the largest 500 ETFS for example)... let me know.
At the very least, I can send you a small set of csv files to work on. Who knows how long the data will be available; this isn't really dealing with ongoing data, but at least you'd hopefully have a cleaner dataset to backtest over.
I've seen similar issues with Quandl data, so I think a good script cleaner would always be useful. And as others have pointed out, I don't see a lot of historical ETF data providers out there.