I realize this is an old thread, but it's an interesting topic that deserves to spend some time at the front! As someone who is currently taking classes in ML at an undergraduate and graduate level, I think I can offer you a couple little pieces of advice that will save you a lot of headache. The way you've purposed this project seems a little off base to me. When I was first taught about evolutionary learning, it seemed to me like 'God's truth.' The idea that you can solve an optimization problem by emulating nature is something beautiful, and even more beautiful is the fact that your solutions suck if you fail to include enough diversity in each population...truly amazing. From an end user standpoint, however, there's no difference if an optimization problem was solved using a brute force sweep, GA, or some gradient based algorithm when applicable (UNLESS YOU MISS THE MAXIMUM!!!). Now, what you're purposing to do is (I believe) is to use evolutionary learning to generate entire algorithms and not just find max/mins of a defined function. This is something I know a little about (not a lot), but I think this lecture from MIT might be helpful.
It gives a lot of info about a cool relevant problem. "Evolving Virtual Creatures" by Karl Sims:
In my brief experience with ML, here's some advice:
Regarding Coding:
1. Starting out C in a linux environment is a BAD idea, as other's have pointed out, here's why:
As you've already heard on this thread:
a. It'll take you an eon, most of what you're doing will be prototyping, don't do that in this environment.
As I'll point out:
b. SINGLE BIGGEST REASON: Good genetic algorithm results aren't the result of just letting a computer run wild, that will never work...computers are good at computing, humans are good at being creative and reasoning. You will need to input new features to permute all the time as parts of your genetic algorithm. You will constantly be coding up new functions to add feature vectors to your dataset...this will be slowed down exponentially by using C in linux.
c. You may need something like OpenCL or CUDA to achieve good performance, so C is good there, but C++ or a language with a wrapper for those is going to make your life much easier.
2. Use other people's code where you can. This is going to be a massive project. You could use libraries to achieve a lot of this, maybe C has some, but look at other options too.
3. If I was you, I would stick to MATLAB, R, python, or C# for now. I can speak for MATLAB and R having tons of resources available for this kind of work, I believe Python does too, but I've never used it for ML.
...enough of the usual ET nonsense where we get off topic bickering and argue about tools, lets talk a bit about my uninitiated intuitions on implementing GA on this sort of problem in general.
Regarding Implementation:
1. GA is an optimization technique. It's not going to just print a little sheet of good trading strategies or something, it requires that you understand what's going on, how your GA implementation works, what your input vectors are, and giving your GA good data and objectives.
2. For the love of God, please please do not make your fitness function net profit. That's a recipe for disaster. You NEED to have some measure of risk included in your fitness function, use Sharpe Ratio or something better like some metric including cVaR or something, but do not use profit as your objective.
3. By nature of GA, you're going to overfit. There's no avoiding that. You need to understand how the resulting algorithms work and hopefully be able to describe a realistic economic reason why they work. If you're making money, you're doing a service to the market to make it more efficient in some perverse way or another, so unless you can say conclusively what that service is, I'd be wary of deploying any such algorithm. Here's a great thread by a really smart guy on that topic:
http://www.elitetrader.com/et/index.php?threads/why-strategies-make-money.287837/
4. You're gonna need a lot of really clean data.
5. If I were you, I would invert the whole process you're purposing (but I doubt you will since that's not your objective). I would first come up with a strategy targeting and 'economic opportunity' ^^^ and then use a genetic algorithm to get you closer to a tradable strategy exploiting this opportunity iteratively.
6. If you're really going to use something like GA, then embrace the curve fitting (sometimes), you're trying to find permutations of an algo that make money, so let the computer try permutation. Make all your input functions take variable arguments in, and have one argument be flag to permute or not to permute X variable or feature of the function, then make another input a vector of scalar parameters for that piece of code which are decided by your optimization routine...better yet, you could probably define which parts to permute recursively too...it should be organized in some way, but I'm not smart enough to tell you what that way is. Make the timeframe you trade on variable, the discrete time intervals features are calculated on variable, the assets you trade, everything variable. I don't really know, there's probably some fancy computer science theory that insures that literally everything possible is made variable in an orderly fashion. If anyone knows what that is please, point us in the right direction. You will also want a way for a human being to control which inputs are permuted and which are not externally so that you as the user are actively involved in the optimization process and can use your human judgment, instead of just letting the computer fit anything it wants all at the same time.
7. You'll probably want to record everything. Maybe come up with some way to database results automatically at each iteration with each parameter set, that way you have a record of what works, what doesn't where your GA tends to get stuck, etc. etc.
8. This is a big one. You're gonna come up with terrible ideas probably 95% of the time using this approach, and the more you allow the computer to replace human creativity, the more likely you are to come up with overfit nonsense. Again, you need to come up with good features to include in your dataset. Look at Karl Sims video, it's an highly iterative process by a renown CS hero to create even those simple GA results. The successful implementation of GA is way more a function of good human input than anything else. GAs themselves come in a open source can you can get and tweak on the internet. Good input is your responsibility.
I've never implemented a GA for anything really complex, and I've never done one that designs a new algorithm, only ones that look for optimal parameters over a defined parameter space. The goal of this type of implementation is to conserve iterations of an optimization algorithm, not to come up with a trading algorithm like you suggest. There are probably people here way more knowledgable on the topic. I suggest you look on quora, google groups, linkedIn, etc etc. to find people really doing this type of thing. Best of luck.