How to explain the most variance in an ETF?

TheBigShort · May 10, 2021

Hi everyone,

I am trying to replicate an ETF using the least amount of components as possible. There are really 2 parts to my questions. 1) How to explain the most variance and 2) How to size the position*

The idea is, if I am trying to replicate QQQ, buying the top 5 holdings may not be the best bet since AAPL is very similar to MSFT. Also there might be a stock way down the list with a vol of 100% that explains variance in QQQ that is not correlated to AAPL, MSFT, etc...

I was thinking of doing a PCA regression - grab all components of QQQ and see which loadings I should use. The issue is, there are too many loadings! So the problem is still not solved.

Does anyone have ideas or links for modern day dispersion trading?

Note* I am looking at this through a vol lens not D1.

For a case study, I have attached a data.frame for QQQ and the components over the last 2 years. I also naively zeroed out large outliers in the dataset.

Thank you for your time

Code:

# A tibble: 6 x 101
       QQQ     AAPL     MSFT     AMZN     TSLA     GOOG       FB    GOOGL     NVDA     PYPL    CMCSA     INTC
     <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
1 -0.00434 -6.51e-3 -1.31e-2 -0.00560  0.0431  -0.00468 -0.00259 -0.00580  1.51e-2 -0.0112  -1.77e-2 -0.00414
2  0.0159   1.24e-2  2.13e-2  0.0324   0.0448   0.0196   0.0153   0.0198  -9.82e-4  0.0206   1.50e-2  0.0237
3 -0.00612 -1.54e-2 -5.82e-3 -0.00607  0.00122  0.00337 -0.00813  0.00329 -1.73e-2 -0.00991  2.31e-4 -0.00418
4 -0.0195  -2.70e-2 -2.05e-2 -0.0151  -0.0324  -0.0129  -0.0212  -0.0122  -3.75e-2 -0.0171  -1.25e-2 -0.0144
5 -0.00252  1.97e-4 -7.95e-5 -0.00168 -0.00899 -0.00667 -0.00121 -0.00685  4.68e-3  0.00110 -4.91e-3 -0.0246
6 -0.00538 -1.07e-2 -7.96e-5 -0.00933 -0.0117  -0.00334 -0.00470 -0.00240 -2.14e-2  0.00605  8.70e-3 -0.0532

Kevin Schmit · May 11, 2021

TheBigShort said:
I am trying to replicate an ETF using the least amount of components as possible... I am looking at this through a vol lens not D1.

The easiest way is:

Group your 100 names into 5 non-overlapping Factors, Sectors, or
clusters. Just use kmeans or densclust or really any clustering
method. Discard all but a few in each cluster (for esample keep
the top four in in each cluster by semi-partial corr with the
index). Try OLS fitting all possibilities of one name from each
cluster, chose the one with the lowest w'Aw where A is your 5x5
corr matrix and w is a vector of the five weights. This metric
is a similarity adjusted Herfindahl stat. Strictly speaking A
should be a similarity matrix with all entries between zero and
one, but since negatve entries in your corr matrix are very few
and small, you can just use the corr matrix.

That is the simple way. The complex way is to spend a considerable
amount of time and effort to estimate the forward or instant covar
matrix, then run a lasso with an additional penalty term involving
the adjusted Herfindahl stat.

This is an interesting question. I am surprised no one has answered it yet. It is interesting not just for dispersion trading, but also because an index and its sparse replicating portfolio often make excellent pairs trades. And there is some evidence that sparse replicators with low pairwise corr (small Herfindahl stat) are more stably cointegrated with the index..

betcashrun · May 11, 2021

Correlation is about past performance and past performance is not an indication of future results.

The simpler the better:
- the five biggest companies (represent 41% of the index - replacing GOOGL by GOOG)
- the ten biggest companies (represent 54% of the index)
...

How to explain the most variance in an ETF?

TheBigShort

Attachments

Kevin Schmit

betcashrun