This example shows how the k-nearest neighbor algorithm and corresponding histogram can be used to make trading decisions.
A short summary of the method is:
- Find three value swings of past data for the input assets.
- Fit a curve with a mathematical function to the value swings.
- Standardize the function to help its outputs be easier to compare to outputs for other assets and/or other time frames.
- Use the k-nearest neighbor algorithm to find outputs for the standardized function for the current asset similar to outputs from standardized functions for historical assets.
- Build a histogram with returns for simulated trades corresponding to the similar past outputs.
- Trade the current asset based on the shape of the histogram.
Here are the details:
Defining Input Value Swings
For each date in a symbol's data, the software finds data for three value swings with algorithm
- The swing direction starts as undefined.
- A lookback period is the past N bars or to the bar just past the previous value swing whichever is larger.
- If no close value in the lookback period is higher than the current close value, the swing direction is up.
- If no close value in the lookback period is lower than the current close value, the swing direction is down.
- If the swing direction is up and down on the same bar (e.g., value hasn't changed since last swing), the swing direction does not change.
- If the swing direction changes from not up to up, a low swing has been detected at the oldest bar with the lowest value in the lookback period.
- If the swing direction changes from not down to down, a high swing has been detected at the oldest bar with the highest value in the lookback period.
- The first swing found is ignored.
Modeling Input Value Swings
Software fits a curve for the data from the current bar to the bar after the value swing that came before the three value swings was detected. The fitted curve has a least squares regression line added to zero or more skewed cosine waves.
A skewed cosine wave starts as an ordinary cosine wave with amplitude, period, and phase. The skew is the proportion of a period for the relative position of the trough of the wave between two peaks. For example here is a cosine wave with amplitude 3, period 20, phase 2.3333 (0 <= phase <= two*pi). and skew 0.5 (no skew).
Here is a cosine wave the same amplitude, period, and phase but with skew 0.8.
Here is a cosine wave the same amplitude, period, and phase but with skew 0.2.
An example of the formula for a fitted function is:
Code:
DIA_20240306_37_opy = 391.993896484375 + 5.76630401611328 * -0.0800857560803043 * x + 2.83467149734497 * (
0.290414988994598 * skewed_cos(twopi / 10.7259641207956, 2.21912574768066, 0.461418867111206, x)
+ 0.224823564291 * skewed_cos(twopi / 32.8402920009586, 2.60685586929321, 0.426971435546875, x)
+ 0.205139502882957 * skewed_cos(twopi / 4.69478300125808, 2.65043544769287, 0.473819971084595, x)
+ 0.167520850896835 * skewed_cos(twopi / 18.5536670608188, 4.66447830200195, 0.53011691570282, x)
+ 0.152913108468056 * skewed_cos(twopi / 5.76414163479297, 4.79805469512939, 0.562792778015137, x) ) ;
The arguments to the skewed_cos functions are frequency, phase (when x == 0), skew, and x value. This fitted curve is for the open prices (adjusted for dividends) for the 37 trading days ending 20240306 for DIA (SPDR Dow Jones Industrial Average ETF Trust). x == 0 is for 20240306; x < 0 is for trading days before 20240306; x > 0 is for (predicted) trading days after 20240306.
The
Goertzel Algorithm discovers the amplitudes, frequencies, and phases. Other software finds the skews for a better fit.
Standardizing Fitted Functions
To help curve values be comparable to curve values from those for assets with different price ranges and/or time frames, the software standardizes them by removing:
- the y intercept from the least squares regression line; e.g., 391.993896484375
- the estimated standard deviation of the price data from the least squares regression line; e.g. 5.76630401611328
- the estimated standard deviation of the differences if price values to the least squares regression line; e.g., 2.83467149734497
An example of a standardized fitted curve formula is:
Code:
DIA_20240306_37_opy = -0.0800857560803043 * x + (
0.290414988994598 * skewed_cos(twopi / 10.7259641207956, 2.21912574768066, 0.461418867111206, x)
+ 0.224823564291 * skewed_cos(twopi / 32.8402920009586, 2.60685586929321, 0.426971435546875, x)
+ 0.205139502882957 * skewed_cos(twopi / 4.69478300125808, 2.65043544769287, 0.473819971084595, x)
+ 0.167520850896835 * skewed_cos(twopi / 18.5536670608188, 4.66447830200195, 0.53011691570282, x)
+ 0.152913108468056 * skewed_cos(twopi / 5.76414163479297, 4.79805469512939, 0.562792778015137, x) ) ;
The output of two fitted curves is more comparable when standardized. For example here are standardized curves for open prices of DIA (SPDR Dow Jones Industrial Average ETF Trust) at 20240305 for 37 trading days and EWW (iShares MSCI Mexico ETF) at 20100423 for 59 trading days where the x == 0 is the current trading day for the curve (20240305 and 20100423).
Running K-Nearest Neighbor Algorithm
In this example, each input has the standardized fitted curves for ETF daily open prices (adjusted for splits and dividends) starting 19930325 and ending 20240221 simulating long trades entering at the next trading day's open and exiting at the following trading day's close. The software calculates the standardized fitted curve values eight trading days before the current day through eight trading days after the current trading day (17 trading days total). The attached, tab-separated knn_etfs.csv lists the 68 ETFs used as inputs.
The inputs for testing are split 70% for training (294,996 records) and 30% for evaluation (126,427 records).
For each evaluation record, the software uses
dynamic time warping to measure the distance to each training record and records a simulated trade result as
log(next_next_trading_day_open / next_trading_day_open) * 100 for the closest 294 training records (a little less than 0.1% of the training records).
Histogram Construction and Interpretation
The software builds a histogram with the simulated results using the
Freedman–Diaconis rule to calculate the bin width. For example, the histogram of DIA (SPDR Dow Jones Industrial Average ETF Trust) at 20240305 is:
The middle marked bin is the mode, and the surrounding marked bins show the first bins surrounding the mode bin that cover at least 68.26% of the 294 values rounded to the nearest whole number (201 values). The 68.26% is the proportion of values within one standard deviation from the mean of a
normal distribution.
A rule to determine whether to enter a long trade at the next trading day's open (and exit at the following day's open) is
Code:
mode > 0
and
(hi_mark - mode) >= (mode - lo_mark)
and
number_of_results_between_mode_and_hi_mark_inclusive <= number_of_results_between_lo_mark_and_mode_inclusive
My theory for this rule is a new result between the mode and high mark (inclusive) is more likely to happen because it creates a more symmetrical histogram (closer to a normal distribution) or keeps the histogram symmetrical.
Simulated Results
For the evaluation data using the above rule, the mean trade result is a gain of 0.1679%, mean win 0.9408%, mean loss 0.8544% and win rate 57.17%.
For the evaluation data using buy and hold, the mean one-day simulated trade result is a gain of 0.0396%, mean win 0.8824%, mean loss 0.9362%, and win rate 53.88%.
On a per-trade basis, the method was better than buy and hold on a large evaluation input.