Garbage data at Polygon.io?

Thanks for having a look at it. Once you find the bug, I would like to know does this influence your entire database and thus requires me re-downloading all the data again. Luckily my ISP doesn't have a data cap. Meanwhile I'll probably write some kind of data integrity check to estimate to what extent my downloaded ~200GB JSON database is corrupted.

It shouldn't affect the entire dataset, however we will certainly validate that. We'll also be able to indicate which points in time were impacted so you don't have to re-download everything.
 
It shouldn't affect the entire dataset, however we will certainly validate that. We'll also be able to indicate which points in time were impacted so you don't have to re-download everything.
Any luck finding the bug?

For my ~5y database of ~1000 stocks I have downloaded from Polygon, I ran a data validation which checked if the wick was >5% of the price, and there were a lot of days which exceeded even this high 5% threshold. They are distributed across all the stocks and throughout the 5y period. I attached a list of dates & stocks to this post that have those >5% spikes.
 

Attachments

In this particular case, we received two trade messages: one with a Prior Reference Price (PRP) and Trade Thru condition, and another with no conditions attached. It's important to note that we exclude PRP or Trade Thru trades from our aggregates. Therefore, it's the second message that has affected the candlestick data.
Looks like the second trade that's causing the trouble and missing the condition is "Cancelled record".

In addition to the fix, it would be good to have an option to exclude dark pool trades from the aggregates. Dark pool trades are not included in real-time candles, so it makes the algos behave differently in real-time vs backtesting. After the fix it may not have big impact on the price data, but it can have quite an impact on volume and algos may use the volume data. If there's also discrepancy in price, it makes the situation worse (e.g. dark pool trades could seemingly trigger stop/limit orders in backtest). You could argue that including dark pool data has "100% market coverage" but this coverage isn't available upon real-time trading so it's not in fact the coverage you want.
 
Last edited:
Got it resolved by downloading the trade data and generating the aggregates myself instead of downloading the aggregates. Took about a week 24/7 to download 1.2TB of zipped data and R&D on how to generate aggregates from trade data using trade conditions and corrections :p
 
Unfortunately I can confirm several spikes with Polygon still persist today.
1. TSLA on 2024-02-05 (like BlackPhoenix) mentioned originally.
2. PLTR on 2024-02-06 (like he mentioned in the file polygon_spikes.txt).
3. NVDA on 2024-03-11 (selected some newer example accidentally).

TSLA_2024-02-05.png

PLTR_2024-02-06.png

NVDA_2024-03-11.png


I dont think, downloading the trade data and generating the aggregates by oneself is a solution but a workaround.
1. It makes the aggregates endpoint superfluous, at least for intraday data.
2. I would have to download and store much more data I do not need.
3. I would have to upgrade to Developer plan from Starter and pay $40 per month more (on annually basis) although I don't need trades.

So can someone of Polygon.io tell, if they intend to fix the spikes in the aggregates endpoint?
 
100% agree, and I had to do exactly that; upgrade to the Developer plan to be able to download the data. All these price spikes make algo trading completely unusable with the data currently. The good thing with the upgraded plan is though that I can now access 10y of data instead of 5y and I can also better filter trades to match real time data from my broker. E.g. Polygon takes bride of having 100% market coverage, but this is pointless if the real time data from your broker doesn't cover 100%, such as dark pools.

When I chatted with Polygon support (I think it was couple of weeks ago) they told me that this is one of their high priority bugs to get fixed, but there's no ETA. That being said, I reviewed long list of data provider offerings and Polygon has very good deal for retail traders and fast downloads. Though I realized that I should upgrade to the "Advanced" plan to get access to quotes to better estimate fill prices, but I'm not quite there to invest $199/mo for the data ;)
 
I started to use Polygon.io for my historical data and using their 5sec aggregates. I haven't paid much attention to the data quality until now and noticed these crazy high/low spikes in the data that appear all over the place, which is obviously a problem for backtesting.

Below is an example of TSLA (Feb 5th, 2024) for Polygon:
View attachment 334204

The left-most spike is at 11:44:55. When comparing to IBKR 5sec bars in TWS I don't see such spikes:
View attachment 334206

Besides the spikes the price data matches pretty close between the two. I first though it might have been an issue in my code that parses the Polygon data, etc. but when I examined the raw data from Polygon I saw this (for the left-most spike):
Code:
{"v":32874,"vw":177.9352,"o":177.8991,"c":177.925,"h":177.97,"l":177.8801,"t":1707151485000,"n":308},
{"v":23869,"vw":177.9224,"o":177.91,"c":177.94,"h":177.94,"l":177.9,"t":1707151490000,"n":155},

{"v":17460,"vw":178.0978,"o":177.9301,"c":177.9315,"h":184.14,"l":177.9194,"t":1707151495000,"n":174},

{"v":16172,"vw":177.9604,"o":177.934,"c":177.9564,"h":177.98,"l":177.934,"t":1707151500000,"n":225},
{"v":19644,"vw":177.9671,"o":177.9575,"c":177.96,"h":177.99,"l":177.95,"t":1707151505000,"n":222},
As you can see the high-value for timestamp "1707151495000" is over 6 points higher (184.14) than the surrounding high-values, so it's definitely an issue in the Polygon data stream.

This is not an isolated incident but I see these same spikes in pretty much every stock for various different days I have checked. They are all over the place.

So, is IBKR filtering the spikes somehow and these are real spikes in the price data, or is Polygon.io data corrupted?

Oh, the joy of historical data.
These spikes are trades that you could not have participated in. Think of firms trading among themselves. These trades are not necessarily reported in real-time, but when reported, they are inserted into the, now historical, data.
Tick data does not have this problem in that there are flags you can filter on (or check the total volume) to ignore the late or non-tradeable trade. True tick data is best. You can get true tick data by capturing the data from your real-time feed or by subscribing to a data feed like NxCore. Aggregate the ticks yourself as needed. Most historical tick data does not have much of the information like bid, ask, bid size, etc.
It would be nice if your data feed had the option of removing the out-of-order data from the historical data, but you can write a filter yourself you like.
 
I have a system, that I sometimes run, that trades in realtime during the day, out at the close.
I can backtest at the end of the day using daily data to see any differences in trades.
I run this system on about 1,000 of the most liquid stock symbols.
I periodically see a backtest trade that did not happen in realtime. I check the daily data and sure enough, there is a big spike. I download the tick data, and it was never even close, no spike. I download DTN, Yahoo EOD data, and someone else's data (I don't remember who), and big spike.
The takeaway is that you trade on realtime tick data. If there are false spikes in your realtime data you need to figure out how the filter them out in realtime. If you are backtesting on anything other than tick data, you may need to figure out a way to check if the data is accurate. Those false spikes will skew your backtest results making your system look way better than it actually is.
In answer to your question, realtime data feeds have flags so you can filter out these rogue trades.
 
Back
Top