Designing a Machine Learning model for forex prediction

destriero · Dec 26, 2023

TrAndy2022 said:
You should use multiple data streams, so you can get intermarket strategies. Only when there is no spurious correlation among instruments you can get an out-of-sample true profits. But do not expect too much here. It is not easy. But if you only look on one symbol itself you will not find any meaningful, because there is no stable autocorrelation here.

Multiple feeds when everyone has been employing triangular arbitrage for decades and are all using EBS as a reference feed, you tw*t.

destriero · Dec 26, 2023

mmutoo said:
Hi all. I am fairly new to this area and just joined the forum. I hope I can learn a lot from the people in this forum and also try to share my knowledge to newbies like myself.

That being said, I have gone through a couple of finance and trading courses (autotrading) and am working on my own code for that. I was hoping people here could help me with my problem. I am using MLP (4 layer, 50 neuron each) for making prediction on forex data. I have been working for a long time on the model now with no luck. I totally understand that prediction accuracy in forex is fairly low, and I am not expecting much. But I am almost %1 above/below random which I think could be better. Now I can provide you with all the details but it will be a pretty long post. I will mention the ones that are essential and ask my question. I am using 20min period for prediction, EUR_USD. I work on Oanda, and use 180 days data to train my model on. I download 5min data and resample to 20min. Then I use some features (my main question will be about these) and use their lagged versions up to 5 lags (5*20 min=1H) but the features contain, for example, moving averages with window of 50 (so I am not limited to 1H). I train my MLP without any regularization and am able to overfit. But no matter what kind of regularizer I add, the end result is that my model cannot generalize. More precisely, the training and validation loss both decrease, the training accuracy increases but the validation accuracy is either almost same value (about %50) or goes down. Right now I have a comprehensive set of features but I am using MA, EMA, real price and volume as features for testing (any feature I have been adding actually makes everything worse).

Now my question is, is MLP a good model for prediction in this case? What kind of regularization works better? and finally what features do you suggest I use?

Go with hourly or preferably 4H.

maciejz · Feb 21, 2024

Your model is WAY too complex. 4 layers with 50 neurons each is over 10K parameters. How large is your dataset? You’d want to have thousands of data points for each parameter due to large noise levels. So, unless you have over 10 million rows of data, your model will produce pretty random results out of sample.

Making the model too complex is a pretty common rookie mistake in ML. The temptation is to create the “greatest” model ever which turns into the most complex model. But, there is something called the bias/variance trade off. It is one of the fundamentals of ML. Quite frankly, getting the model complexity right is quite challenging and is one of the reasons why ML is sometimes referred to as an art.

There are lots of books to help get you started with ML, but one of the best ones focused on these foundational concepts is “Learning from Data: a short course” by Yaser Abu-Mostafa, et all.

Applying ML to trading is non-trivial. If it were easy, everyone would be doing it

One of the main things it requires is knowing ML very well. I don’t mean knowing how to use the available libraries, that’s not knowing ML

Even knowing enough ML to implement the algorithms (NN or gradient boosted trees) yourself is not enough, although it is a necessary step. You really need to know what is going on underneath the covers; because what’s going on there is not magic

If you don’t, then you’re just throwing shit against the wall. And, there are plenty of people who do that, but their results out of sample look nothing like what they expect.

My other piece of advice would be to forget neural networks and deep learning. Yes, it is very powerful and all that, but it is not the most efficient approach to the trading problem, which is a tabular data problem. Gradient boosted trees are pretty much the state of the art for tabular data problems. This doesn’t mean that you can throw XGBoost or LightGBM at your data and have a tradeable model. It’s just that most people have an easier time understanding what’s going on in a tree model than understanding what’s going on in a neural network; also gradient boosted trees should be way faster to train than NNs. And, if you understand what’s happening underneath the covers, then you’ll understand that both approaches should produce very similar models if you have things tuned properly.

Anyhow, good luck on your journey, it’s a fascinating one.

mmutoo · Feb 22, 2024

maciejz said:
Your model is WAY too complex. 4 layers with 50 neurons each is over 10K parameters. How large is your dataset? You’d want to have thousands of data points for each parameter due to large noise levels. So, unless you have over 10 million rows of data, your model will produce pretty random results out of sample.

Making the model too complex is a pretty common rookie mistake in ML. The temptation is to create the “greatest” model ever which turns into the most complex model. But, there is something called the bias/variance trade off. It is one of the fundamentals of ML. Quite frankly, getting the model complexity right is quite challenging and is one of the reasons why ML is sometimes referred to as an art.

There are lots of books to help get you started with ML, but one of the best ones focused on these foundational concepts is “Learning from Data: a short course” by Yaser Abu-Mostafa, et all.

Applying ML to trading is non-trivial. If it were easy, everyone would be doing it One of the main things it requires is knowing ML very well. I don’t mean knowing how to use the available libraries, that’s not knowing ML Even knowing enough ML to implement the algorithms (NN or gradient boosted trees) yourself is not enough, although it is a necessary step. You really need to know what is going on underneath the covers; because what’s going on there is not magic If you don’t, then you’re just throwing shit against the wall. And, there are plenty of people who do that, but their results out of sample look nothing like what they expect.

My other piece of advice would be to forget neural networks and deep learning. Yes, it is very powerful and all that, but it is not the most efficient approach to the trading problem, which is a tabular data problem. Gradient boosted trees are pretty much the state of the art for tabular data problems. This doesn’t mean that you can throw XGBoost or LightGBM at your data and have a tradeable model. It’s just that most people have an easier time understanding what’s going on in a tree model than understanding what’s going on in a neural network; also gradient boosted trees should be way faster to train than NNs. And, if you understand what’s happening underneath the covers, then you’ll understand that both approaches should produce very similar models if you have things tuned properly.

Anyhow, good luck on your journey, it’s a fascinating one.

@maciejz Thank you for the detailed reply. I am actually not completely new to the ML field. I have read the book you mentioned. Regarding the number of samples, I have about 500K, and I monitored the learning process to make sure the model is not overfitting. I also tested XGBoost and Random Forest which gave me a similar result. I even tried an ensemble method. Anyways, I will look into what you mentioned and dive deeper.

mmutoo · Feb 22, 2024

destriero said:
Go with hourly or preferably 4H.

Thanks. Will try that one too.

maciejz · Feb 22, 2024

mmutoo said:
@maciejz Thank you for the detailed reply. I am actually not completely new to the ML field. I have read the book you mentioned. Regarding the number of samples, I have about 500K, and I monitored the learning process to make sure the model is not overfitting. I also tested XGBoost and Random Forest which gave me a similar result. I even tried an ensemble method. Anyways, I will look into what you mentioned and dive deeper.

How deep were your trees in XGBoost? Also, I know that some implementations don’t respect a minimum split size, which is going to produce more random results.

mmutoo · Feb 26, 2024

I don't remember that since it was long ago. I will check. Is there a rough number for that amount of data?

maciejz · Feb 26, 2024

In terms of minimum size, keep in mind what models like XGBoost are doing underneath. On each iteration, they are building a decision tree on your data. So, if you are using a max depth of 1 (for illustration here), which means one split on a single feature (pretty much the simplest model possible) the algo makes a split point and calculates the y_hat on the left and right side of the split point by taking a mean of the y values in each of those regions. The "optimal" split point is the one that will minimize your error, probably MSE (it's a good idea to use MSE even for trading systems).

Now, the reason all of the above is related to your question about the minimum split size, is because y_hat for a split is the mean of the y points in that split. That is literally the definition of a sample mean. And we know that the standard error (SE) of the mean (so, the SE of y_hat) is inversely proportional to the square root of the sample size. Well, the sample size is your split size -- it is the number of data points on the left (or right) side of your split point. So, if you increase your minimum split size by a factor of 4, you decrease the standard error of your y_hat by a factor of 2.

Reducing the SE of your y_hat is going to reduce the variance of your model, and typically you want that. Optimizing the bias/variance trade-off can't be done in-sample. So, to truly determine the optimal model complexity, including the max depth and minimum split size, requires cross-validation. You'd have to try different hyper-parameter configurations and see which performs best on validation. Keep in mind that this burns your validation data and thus you can't use your validation performance as an estimate of future out-of-sample performance. You'd have to have yet another clean data set that you'd use to estimate your expected OOS performance.

From my experience with daily frequency models, I'd say that you want at the very very least 1K data points in each split; but the more the better, typically. This will also affect your model complexity. If you have 500K data points, and you want a min of 1K in each split and have about 10 potential splits per dimension, you could get to a depth 6 model. But, that's way overkill IMHO. I'd stick to a depth 3 or lower, and increase min split size to 10K.

One more thing, XGBoost, at least when I tried it last, did not respect the minimum split size configuration. This is one of the reasons why XGBoost will often not provide very good results on data with low signal to noise, like pretty much all financial data.

Hope that helps. Let me know what you find.

blueraincap · Mar 5, 2024

Financial Machine Learner