In the past posts, I have mainly been talking about automated trading strategies based on simple logic, rule-based and technical analysis driven. In this post I want to share how we can use machine learning algorithms, particularly those that are suited for classification problems to predict the next day market direction. Yes, I mean only the next day price direction, and not the next month or next 6 months. The reason why I am focusing on such a small time horizon if because it should in theory be easier to predict the short-term, rather than the medium or long term.Let’s jump into it.
I will use light crude oil futures data. I have the data loaded from a csv file. The data contains, daily Open, high, low and close price of crude oil futures (CL on Nymex). I then format or change the dataframe into an xts (specific time series format) in order to plot it with a candlestick plotting function from the Quantmod package.
library(tidyverse)
library(lubridate)
library(quantmod)
library(ggplot2)
price <- read.csv("wtiDaily.csv")
setup_data <- function(pricedata) {
#Small function to format the data into xts format
names(pricedata) <- c("Date", "Open", "High", "Low", "Close")
dates <- parse_date_time(x = pricedata[,1], "mdy_HM", truncated = 3)
pricedata <- pricedata[,2:5]
pricedata <- xts(pricedata, order.by = dates)
}
price <- setup_data(price)
ema7 <- EMA(price$Close, n = 7)
ema20 <- EMA(price$Close, n = 20)
ema50 <- EMA(price$Close, n = 50)
ema70 <- EMA(price$Close, n = 70)
After having formatted the data, i use the function chartSeries to plot the data.
chartSeries(price, TA=NULL, subset = '2017-06::')
addEMA(n = 7,col = "orange")
addEMA(n = 20,col = "red"
Now that i have the graph, i need to explain how i am planning to use machine learning predict the next day price direction. Another of saying this is also predicting the next day candle type. Since a daily bull/green candle is equivalent to price going higher on that day, and a bear/red candle is equivalent to price going lower on that day; It means i will be predicting whether the next day’s candle will be a bull or a bear candle. In order to do so, looking at the chart above, my hypothesis is if a use some variables as predictor variables or features in my machine learning algorithm, i should be able to predict fairly well the next day’s candle and therefore have an increasing equity curve. For variables or features, i will use the following:
candle.type.current <- data.frame(ifelse( price$Close > price$Open, "bull", "bear"))
candle.type.previous <- data.frame(lag(candle.type.current$Close, n = 1))
candle.next.day <- data.frame(lead(candle.type.current$Close, n = 1))
position.to.ema7 <- data.frame(ifelse(price$Close > ema7, "above", "below"))
dailywin <- data.frame(abs(price$Close - price$Open))
candle.nextday.win <- lead(dailywin$Close, 1)
#Making up the dataframe with all the features as columns
dailyprice <- data.frame(candle.type.current,candle.type.previous,
position.to.ema7, dailywin,
candle.nextday.win, candle.next.day)
# naming the dataframe columns
names(dailyprice) <- c("candle.type.current", "candle.type.previous",
"position.to.ema7", "dailywin",
"candle.nextday.win","candle.next.day")
#filtering out the data with NAs
dailyprice <- slice(dailyprice, 7:length(dailyprice$candle.type.previous))
After defining the features and i then go ahead and divide the data into train and test set. The data contains about 500 obversations (Trading days from June 2017 to August 2018). I use 300 data points as training set and the rest as test set. I also create a “formula”, specifying what is my target variable (what i am trying to predict) and the features.
#Splitting the data into training and testing
testRange <- 300:500
trainRange <- 1:300
test <- dailyprice[testRange,]
train <- dailyprice[trainRange,]
#Defining the formula: Target variables and predictors
target <- "candle.next.day"
predictors.variable <- c("candle.type.current", "candle.type.previous",
"position.to.ema7", "dailywin")
predictors <- paste(predictors.variable, collapse = "+")
formula <- as.formula(paste(target, "~", predictors, sep = ""))
#function for processing predictions
predictedReturn <- function(df, pred){
df$pred <- pred
df$prediReturn <- ifelse(df$candle.next.day != df$pred, -df$candle.nextday.win, df$candle.nextday.win)
df$cumReturn <- cumsum(df$prediReturn)
return(df)
}
Now let’s go ahead and train our machine learning model. Here i train a naive bayes algorithm (Learn more about Naive Bayes ).
library(naivebayes)
# Naivebayes model
nb <- naive_bayes(formula, data = train)
plot(nb)
# Prediction
nb.pred <- predict(nb, test)
nb.test <- predictedReturn(test, nb.pred)
#Plotting the net daily returns and cumulative returns
plot(nb.test$prediReturn, type = "line")
plot(nb.test$cumReturn, type = "line")
#Confusion matrix
confusionMatrix.nb <- table(nb.test$candle.next.day, nb.test$pred)
print(confusionMatrix.nb)
#Calculating accuracy
nb.misclserror <- mean(nb.test$candle.next.day != nb.test$pred)
print(paste("Accuracy", 1-nb.misclserror))
So the model evaluation on the test set is not dissapointing. It returns an accuracy of 57% which although it might seem low, is not a bad result in the trading arena, for such a limited amount of data. Obvisouly, more validation tests should be performed by changing the training and test window, carry out more serious cross validation and idealy use more data. Nevertheless, the equity curve looks interesting it is going higher (moneyyyyyy). We should not understimate the magnitude of the drawdowns, but i can say it looks good.
bear bull
bear 24 59
bull 28 90
[1] "Accuracy 0.567
I explain in more details how to add more features and also test other classification-based machine learning algorithms to predict next day’s price on Udemy. I felt i needed to create a course to share this knowledge and help others create their own trading strategies using machine learning with R.
I will use light crude oil futures data. I have the data loaded from a csv file. The data contains, daily Open, high, low and close price of crude oil futures (CL on Nymex). I then format or change the dataframe into an xts (specific time series format) in order to plot it with a candlestick plotting function from the Quantmod package.
library(tidyverse)
library(lubridate)
library(quantmod)
library(ggplot2)
price <- read.csv("wtiDaily.csv")
setup_data <- function(pricedata) {
#Small function to format the data into xts format
names(pricedata) <- c("Date", "Open", "High", "Low", "Close")
dates <- parse_date_time(x = pricedata[,1], "mdy_HM", truncated = 3)
pricedata <- pricedata[,2:5]
pricedata <- xts(pricedata, order.by = dates)
}
price <- setup_data(price)
ema7 <- EMA(price$Close, n = 7)
ema20 <- EMA(price$Close, n = 20)
ema50 <- EMA(price$Close, n = 50)
ema70 <- EMA(price$Close, n = 70)
After having formatted the data, i use the function chartSeries to plot the data.
chartSeries(price, TA=NULL, subset = '2017-06::')
addEMA(n = 7,col = "orange")
addEMA(n = 20,col = "red"
Now that i have the graph, i need to explain how i am planning to use machine learning predict the next day price direction. Another of saying this is also predicting the next day candle type. Since a daily bull/green candle is equivalent to price going higher on that day, and a bear/red candle is equivalent to price going lower on that day; It means i will be predicting whether the next day’s candle will be a bull or a bear candle. In order to do so, looking at the chart above, my hypothesis is if a use some variables as predictor variables or features in my machine learning algorithm, i should be able to predict fairly well the next day’s candle and therefore have an increasing equity curve. For variables or features, i will use the following:
- Close price in relation to exponential moving average of the last 7 periods
- today’s candle type (bull or bear)
- previous day’s candle type (bull or bear)
- daily return
candle.type.current <- data.frame(ifelse( price$Close > price$Open, "bull", "bear"))
candle.type.previous <- data.frame(lag(candle.type.current$Close, n = 1))
candle.next.day <- data.frame(lead(candle.type.current$Close, n = 1))
position.to.ema7 <- data.frame(ifelse(price$Close > ema7, "above", "below"))
dailywin <- data.frame(abs(price$Close - price$Open))
candle.nextday.win <- lead(dailywin$Close, 1)
#Making up the dataframe with all the features as columns
dailyprice <- data.frame(candle.type.current,candle.type.previous,
position.to.ema7, dailywin,
candle.nextday.win, candle.next.day)
# naming the dataframe columns
names(dailyprice) <- c("candle.type.current", "candle.type.previous",
"position.to.ema7", "dailywin",
"candle.nextday.win","candle.next.day")
#filtering out the data with NAs
dailyprice <- slice(dailyprice, 7:length(dailyprice$candle.type.previous))
After defining the features and i then go ahead and divide the data into train and test set. The data contains about 500 obversations (Trading days from June 2017 to August 2018). I use 300 data points as training set and the rest as test set. I also create a “formula”, specifying what is my target variable (what i am trying to predict) and the features.
#Splitting the data into training and testing
testRange <- 300:500
trainRange <- 1:300
test <- dailyprice[testRange,]
train <- dailyprice[trainRange,]
#Defining the formula: Target variables and predictors
target <- "candle.next.day"
predictors.variable <- c("candle.type.current", "candle.type.previous",
"position.to.ema7", "dailywin")
predictors <- paste(predictors.variable, collapse = "+")
formula <- as.formula(paste(target, "~", predictors, sep = ""))
#function for processing predictions
predictedReturn <- function(df, pred){
df$pred <- pred
df$prediReturn <- ifelse(df$candle.next.day != df$pred, -df$candle.nextday.win, df$candle.nextday.win)
df$cumReturn <- cumsum(df$prediReturn)
return(df)
}
Now let’s go ahead and train our machine learning model. Here i train a naive bayes algorithm (Learn more about Naive Bayes ).
library(naivebayes)
# Naivebayes model
nb <- naive_bayes(formula, data = train)
plot(nb)
# Prediction
nb.pred <- predict(nb, test)
nb.test <- predictedReturn(test, nb.pred)
#Plotting the net daily returns and cumulative returns
plot(nb.test$prediReturn, type = "line")
plot(nb.test$cumReturn, type = "line")
#Confusion matrix
confusionMatrix.nb <- table(nb.test$candle.next.day, nb.test$pred)
print(confusionMatrix.nb)
#Calculating accuracy
nb.misclserror <- mean(nb.test$candle.next.day != nb.test$pred)
print(paste("Accuracy", 1-nb.misclserror))
So the model evaluation on the test set is not dissapointing. It returns an accuracy of 57% which although it might seem low, is not a bad result in the trading arena, for such a limited amount of data. Obvisouly, more validation tests should be performed by changing the training and test window, carry out more serious cross validation and idealy use more data. Nevertheless, the equity curve looks interesting it is going higher (moneyyyyyy). We should not understimate the magnitude of the drawdowns, but i can say it looks good.
bear bull
bear 24 59
bull 28 90
[1] "Accuracy 0.567
I explain in more details how to add more features and also test other classification-based machine learning algorithms to predict next day’s price on Udemy. I felt i needed to create a course to share this knowledge and help others create their own trading strategies using machine learning with R.