Is Walk-Forward (out of sample) testing simply an illusion?

Suppose there are three periods A,B,C
SegA runs 1.0→2.0
SegB runs 2.0→1.0, and
SegC runs 1.0→1.0

Models performing best in
SegA would show positive (Cartesian) linear behavior,
SegB would show negative linear behavior, while
SegC would exhibit a graceful curvelinear arc and do best with a second-order parameter of negative sign, to degrade (and eventually overwhelm) a positive first-order parameter.

So....
1) The border matters.
2) W.R.T. subset performance, SegA+SegB =/= SegC.
 
You optimize your strategy on segment 1 and then test on segment 2. You are not allowed to optimize using data from segment 2. That is the difference between the two situations that you are describing.

The end result is the same though! By keeping only models that look pretty on a segment 2 test we have in fact manually optimized the system to segment 2. We might as well optimize on the whole segment 3. The end selection of systems will be the same.
 
Suppose we have 1000 models. Two classrooms in two different rooms get the same models and the same data.

The first classrooms is full of geniuses who know wassup and will test the "clever way", meaning they test all on seg1, then test pretty ones on seg2 and keep the ones that test pretty on seg2. That is their end selection.

The second classroom is full of idiots and they test the "dumb way". They run all models on the combined seg3 which is a simple continuous combo of seg1 and seg2 and keep the pretty ones for their end selection.

Question: don't you think the classrooms will end up with the same end selection of models? ;)

Bonus question: don't you think at least some of the models in the end selection are there by random chance because we tested a lot of models and contain no alpha?


It's very easy to show that both classrooms will not necessarily end up with the same selection of models. All you need to do is imagine that model_n did very poorly on the first segment, and did fantastically well on the second segment. So well, in fact, that the model_n performance on the combined set did better than all of the other models. If the other classroom only looks at the entire concatenated set of data, they will choose model_n, however, class one already threw out that model in the first segment selection step by filtering the sets that pass to segment 2.
Hopefully you can see the set of models at the final step are not equivalent.

But that point is sort of trivial. What's more important is that both of the methods of filtering the best of the distributions suffer from the same selection bias. You can devise a different way to use the two segments to choose models, that can give you a better likelihood of performing well on a new unseen segment, that you cannot do with only one combined segment.

If you are really interested to understand it from a more modern statistical perspective, you can look into topics on bias/variance tradeoff. But you will find very little literature applying it to your application, that's up to you to figure out.
 
Suppose there are three periods A,B,C
SegA runs 1.0→2.0
SegB runs 2.0→1.0, and
SegC runs 1.0→1.0

Models performing best in
SegA would show positive (Cartesian) linear behavior,
SegB would show negative linear behavior, while
SegC would exhibit a graceful curvelinear arc and do best with a second-order parameter of negative sign, to degrade (and eventually overwhelm) a positive first-order parameter.

So....
1) The border matters.
2) W.R.T. subset performance, SegA+SegB =/= SegC.

You guys are failing to consider the x-axis:
Sk65hccT-.png


You guys are trying to apply the model as though the domains are equal for all three segments.

I agree with @pursuit

Hopefully, I don't have to expound. :)
 
It's very easy to show that both classrooms will not necessarily end up with the same selection of models. All you need to do is imagine that model_n did very poorly on the first segment, and did fantastically well on the second segment. So well, in fact, that the model_n performance on the combined set did better than all of the other models. If the other classroom only looks at the entire concatenated set of data, they will choose model_n, however, class one already threw out that model in the first segment selection step by filtering the sets that pass to segment 2.

No, model_n will not be selected by the class testing on segment 3. They will see that it did shitty on segment 1 part of segment 3 and will discard it. They're looking for a pretty graph throughout segment 3. That is impossible if segment 1 performed horribly.
 
No, model_n will not be selected by the class testing on segment 3. They will see that it did shitty on segment 1 part of segment 3 and will discard it. They're looking for a pretty graph throughout segment 3. That is impossible if segment 1 performed horribly.

That really depends on how you define what 'shitty' or 'good' is. Notice you never did define it. 'Looking' at a graph characteristic in segments, is not the same as using a single quantitative metric to determine the outcome(s). In my case I simply used terminal wealth as a proxy (which isn't that uncommon). Step it up to a trillion models, and you really won't have the foresight to know if it is good or bad by your reasoning. Using terminal wealth for example, my scenario shows your hypothesis is not conclusive. If you are going to make a blanket statement, it has to cover all cases, any one case that disproves it, disproves your statement.

Anyways, not here to argue. If you don't get anything from it, no need to post more.
 
The end result is the same though! By keeping only models that look pretty on a segment 2 test we have in fact manually optimized the system to segment 2. We might as well optimize on the whole segment 3. The end selection of systems will be the same.

Yeah, you're probably right.

The main question then would be how long of a segment to use for backtesting.
 
That really depends on how you define what 'shitty' or 'good' is. Notice you never did define it. 'Looking' at a graph characteristic in segments, is not the same as using a single quantitative metric to determine the outcome(s).

We all know what shitty and pretty equity curves look like. There is no issue with quantifying it. We can use Sharpe ratio, Sortino, R-squared or any other accepted measure of smoothness. Does not change the validity of my point one bit.
 
Yeah, you're probably right.

The main question then would be how long of a segment to use for backtesting.

If using just segment 3 is as good as the combo of 1 and 2 then it would be reasonable to use segment 3 of maximum length so our model can experience a wide variety of market conditions in the test.
 
If using just segment 3 is as good as the combo of 1 and 2 then it would be reasonable to use segment 3 of maximum length so our model can experience a wide variety of market conditions in the test.

Your original query is whether using segment 3 was the same as using segment 3 ... after splitting it into two segments (1 and 2), when comparing *pre-built* models. It obviously (to some) is.

This is not the same as saying, in effect, if not in these exact words, "Hey! I should *build* my models on all possible data, and not leave some data out for testing/validating!"
 
Back
Top