Quote from smili:
This is my first set of data I'm running through the program. Just seeing how it works and learning the options.
It's daily SPY data back through mid 1993, about 3400 observations in total. Broken into thirds for the training, validation, and applied datasets. (applied is the last third, and training and validation are the first two-thirds of data randomized). The applied data is not used in any way in forming the model, but is used only to test the model on data it's never seen.
The output (forecast) is either 1 day future SPY pct change, or 3 day future, 5 day future chg, or 10 day future chg, etc.
For input variables I've calculated several moving averages, channels, gaps, prior day changes, daily range, some other indicators. Just inputs to let it chew on to help project the output.
I'm thinking the randomizing of the training and validation data is to remove systematic biases in the data so that it's not training on just 93-98, validating on 99-03, but I'm not really sure why it needs the data broken up as above in the first place. In college I enjoyed my econometrics projects, but I didn't have to break the data down in 3 ways like this, so not sure.
Part of me wonders if I'm using too long time period, and if I should include more recent data in the training and validation set, but the manual actually recommends breaking data into thirds as mentioned above.
thanks for the reply.
You should break up the data into 2 sets, one for training and one for validation. The model is trained using the first set, and the model is validated (or tested) using the second set. The result of the validation should be compared to the result of the training to make sure that the model is still valid.
Apply the model to the future data. You always compare the result with your training set to justify that the model is still applicable.