Appreciate any other books & papers you think would be beneficial.
I have found the two attached papers to be very useful. In the regression context this means that you are likely to be much more successful when p >> n (when there are many more columns in your design matrix than rows). Your worst out of sample results are likely to be where p is close to n. This is particularly important in time-varying parameter/coeff regression, where the number of effective observations is small (in each individual lookback kernel. this goes for state-space estimates as well as explicit half-kernel coeff estimates), and the pool of potential predictor variables is large. Essentially Kelly et al show that Fisher's Infinitesimal Models applies to market forecasting as well as to genetics (Omnigenics, "benign overfit"). Two of the authors are associated with AQR, and Kelly has an impressive sounding title there (as well as his Yale association).
Moving from overdetermined to underdetermined regression will likely [as much as] double out-of-sample R^2. This may be the single easiest thing an ET basement quant can do to improve his forecast metrics.
From Virtue of Complexity in Return Prediction:
"We prove that, in the high-complexity regime (P >> N), expected out-of-sample forecast accuracy and portfolio performance are strictly increasing in model complexity. The analyst should always use the largest approximating model that he can compute. Applying optimal shrinkage to large P models enhances performance further (indeed, we derive the choice of shrinkage that maximizes expected out-of-sample model performance). In other words, when the true data- generating process (DGP) is unknown, the approximation gains achieved through model complexity dominate the statistical costs of heavy parameterization."
Also:
footnote 5:
"Relatedly, Rapach et al (2010) show that MSE decomposes into a scale-free (correlation) component and a scale-dependent component. It is the scale- free component that is important for trading strategy performance."
Note: optimal shrinkage parameter referenced in the quote (e.g. ridge lambda) above will often be slightly negative in the p >> n regime.