Modeling question - genetic programming

yip1997 · Sep 24, 2007

Quote from smili:

This is my first set of data I'm running through the program. Just seeing how it works and learning the options.

It's daily SPY data back through mid 1993, about 3400 observations in total. Broken into thirds for the training, validation, and applied datasets. (applied is the last third, and training and validation are the first two-thirds of data randomized). The applied data is not used in any way in forming the model, but is used only to test the model on data it's never seen.

The output (forecast) is either 1 day future SPY pct change, or 3 day future, 5 day future chg, or 10 day future chg, etc.

For input variables I've calculated several moving averages, channels, gaps, prior day changes, daily range, some other indicators. Just inputs to let it chew on to help project the output.

I'm thinking the randomizing of the training and validation data is to remove systematic biases in the data so that it's not training on just 93-98, validating on 99-03, but I'm not really sure why it needs the data broken up as above in the first place. In college I enjoyed my econometrics projects, but I didn't have to break the data down in 3 ways like this, so not sure.

Part of me wonders if I'm using too long time period, and if I should include more recent data in the training and validation set, but the manual actually recommends breaking data into thirds as mentioned above.

thanks for the reply.

You should break up the data into 2 sets, one for training and one for validation. The model is trained using the first set, and the model is validated (or tested) using the second set. The result of the validation should be compared to the result of the training to make sure that the model is still valid.

Apply the model to the future data. You always compare the result with your training set to justify that the model is still applicable.

yip1997 · Sep 24, 2007

Quote from jack hershey:

Thanks for your response.

I leafed through the manual and the binary part appealed to me. It will not take True but I could use 1's and 0's. Pages 169 and 170 looked a little redundant.

Model with binary output might not be a profitable model. Assuming your model is correct (i.e. high accuracy in the output), you have a good accuracy (say 90%) of knowing if it is going up or down. But the profit in those profitable days might be small and the loss might be huge.

When you use another training software, the first is to determine the right criterion. Even with the most popular mse is not good.

Corey · Sep 24, 2007

The reason you randomize the 2/3rds of the input is to guarantee you are not training on a time series, but rather in individual points. It also helps ensure that the model is more robust. If you train in order, the model is often overtrained towards the present, and forgets the past (making it less robust, but often more accurate for short time periods).

Well, for Neural Networks at least. Genetic programming is a generic buzzword that basically means they are using some sort of data-system that parallels biological evolution, often optimizing over a solution set based on fitness functions.

So why would randomizing input be important for genetic programming? For pretty much the same reason as I said above -- to generate a more robust model. If you are constantly making comparisons between genes generated during the same time period, you won't for a good long-term comparison model. You are choosing long term robustness over short term (in)accuracy.

Kohanz · Sep 24, 2007

Just a question out of curiosity on this topic, for the experts.

My instinct tells me that if you have a time-series that you are doing testing on, if you randomize the samples in this series (I think the term "observations" was used earlier), assuming that the time-series is a single-realization of a random process (possibly my mistake?), aren't you irreparably altering the statistics of that time-series, therefore making it an invalid realization of the underlying random process?

I'm thinking of this from an engineering background, but basically, if you take a signal, and randomize the samples, you will change the frequency content/characteristics of the data - and the new data set will not be anything like the old one. Therefore, although the initial data-set is a good example of a financial time-series, I don't see how a randomized version of that same series could be seen as a valid representative financial time series.

Any input would be appreciated, I'm just curious to understand this.

Corey · Sep 24, 2007

Your assumption would be correct ... if you WERE looking at it as a time series. Rather, most neural networks and genetic algorithms use single time snapshots whose datapoints take into account time series change. So instead of feeding in five separate data points in a row and getting a result, you feed in one data point that might comprise the rate of change over that period.

By doing it in this way, you prevent from overoptimizing to a certain time period. You randomize over a large time period, but have each input use variables which describe the time period it is in.

But take my knowledge as opinion -- I am certainly no 'expert.'

Modeling question - genetic programming

yip1997

yip1997

Corey

Kohanz

Corey