Anyway here is what I personally do, which will hopefully sidestep any arguments that are mainly about definitional disagreements.This works for me, because I trade fully systematically with a relatively large number of fairly simple trading rules, and then use the combination of their forecasts.
* select / define some rules
Ideally you wouldn't look at any real performance data at all when doing this (what I would define as 'in sample' backtesting). Otherwise you would be at risk of pre-selecting only rules that were good, which will inflate any performance numbers you look at subsequently (the original post I made).
Rules ideally should have a minimum of parameters, and ideally it should be possible to define sensible 'default' values for these parameters.
So we aren't selecting rules on the basis of performance, but on behaviour. So for example does it capture the kind of behaviour I expect, i.e. buy into uptrends, cut in reversals? Is the holding period appropriate, given the length of forecast we're trying to make? Does the rule add some diversification or is it 99% correlated to another rule? Does the algorithim have corner cases, how does it handle missing values, that kind of thing.
You can use artifical data to make these kinds of decisions. You can also use 'scenarios' i.e. snapshots of historical events which you run the rule over and check the behaviour makes sense. If you don't have any choice, and you're disciplined, you can use real data; again though you must avoid the temptation to look at p&l.
Note this doesn't mean you will definitely get realistic achievable historical performance. You're probably still only testing stuff you know sort of works (there was another thread about this
here) from your own experience or reading literature or reading about someone elses backtest. Eg we know momentum generally works, we know carry works, we know certain kinds of groups of assets display mean reverting behaviour? This is an implicit overfitting, which we can't get away from unless we use relatively general rules with larger numbers of parameters fitted entirely from the data (a method which has its own pitfalls).
Other more subtle reasons are that long, unrepeatable, secular trends in the past inflate backtested performance.
* decide what weights to choose when combining rules
This is a portfolio allocation problem. We want to upweight good rules, although not by too much unless there is a clear statistical difference in performance. So we use real performance data, but only from the past at each point of the simulation.
I do this on an 'expanding window out of sample' basis (which is what I mean when I say WFA) using non parametric bootstrapping. Any technique that doesn't 'over extrapolate' insignificant differences in performance into radically different allocations is fine. The important thing is we're only using past data to fit with.
* backtest or simulate
With the rules, and the dynamically changing allocations, look at the performance. This is purely to get an idea of the likely behaviour of the system to calibrate our expectations about things like drawdown levels and sharpe ratio, and hence make sure our risk target isn't to aggressive.
We shouldn't make any changes to rules or parameters at this stage or they will be entirely in sample.