You don't really have to go through all of that, you know. Simply take the backtest sample, exclude your trades (under assumptions that both samples have some meat to them) and look if the strategy results (e.g. mean) are statistically significant. Then you could apply the two sample T test or the Wilcoxon rank sum test - both are nice if normality cannot be assumed.
E.g. let's assume your targets are unit (1, 0, -1). If your backtest shows pnl = target * returns, you can create a subset xnl = returns[ target == 0] and run them through a two sample t-test.
PS. of course, beware of the family-wise errors if you are doing any sort of "extensive studies"