Reinforcment Learning

cjbuckley4 · Jul 14, 2015

Anyone here do any of it?

IAS_LLC · Jul 14, 2015

My masters thesis utilized reinforcement learning to control an otherwise uncontrollable aircraft that had sustained some sort of in-flight damage (stuck aileron, blown off wing, etc..) . I may be alone in this... but I don't make a huge distinction between Supervised and Unsupervised learning. By providing a cost function to be minimized (or maximized), you are really saying you want a particular Input/Output mapping characteristic, just like in Supervised learning.

As for trading... I'm toying with it right now. Updating parameters in real time to maximize the expected value with an acceptable distribution.

cjbuckley4 · Jul 14, 2015

@IAS_LLC I think that it could be very useful to weight the reinforcement by time in some way. For example, suppose we're dealing with some microstructure model that gives the probability of an uptick or downtick conditional on x,y,z, as well as our median latency. Suppose you're taking liquidity and you're concerned about someone getting there first...you could use...for instance (spitballing)...

{no trade if t <= your median latency,
{profit*(Gamma(t; k, theta)) where t = trade time, t > median latency, k > 1, with k and theta fit out of sample.

The reason I suggest gamma 1 > k is because it will weight trades closer to the time you experience competition lower, higher during the "sweet spot" and lower as t -> inf, because this is microstructure and the signal could become less valid as time continues. Also, most reinforcement algorithms tend to weight slower accurate predictions lower from what I've seen.

Feel free to PM me or post if you have any ideas/papers/thoughts.

cjbuckley4 · Jul 14, 2015

Edit^ k > 1 in the last paragraph.

cjbuckley4 · Jul 15, 2015

http://webdocs.cs.ualberta.ca/~sutton/book/ebook/the-book.html

IAS_LLC · Jul 15, 2015

I'm not sure I follow. Are you essentially wanting to "forget" what you've learned a long time ago because you suspect it is not as relevant as the more recent input data? If yes... In my experience, this happens somewhat automatically if you are using online training. Batch training is a different story...Although I dont think it would be too hard to manipulate the data presentation or the training procedure itself to favor more recent data. Perhaps augment the cost function with the age of the data ( Cost_aug = Cost/dataAge ) ?

cjbuckley4 · Jul 15, 2015

No, that's not really what I was suggesting. My suggestion was just an example where there was some interval of time between our trade signal (call that t) and the time the trading opportunity came to fruition (call that X), t < X. Now, suppose we believe that trades where the interval t - X is too close to zero may be too competitive for us: maybe we want to weight the super high frequency trading opportunities lower so we don't end up fitting a model that only competes with the pros. Furthermore, maybe as t -> infinity, maybe the probability that our signal is just correct because of random noise and has nothing to do with the validity of our model gets higher. My suggestion is we map t-X to values on a function, such as this one, Gamma:

And weight the outcome profit by the value we mapped to. The goal being to reinforce the learning algorithm when it finds opportunities that meet our criteria for competitiveness and hopefully remove some trading signals that are just generated by noise. Just an idea, I'm still just reading the introductory material on reinforcement learning.

IAS_LLC · Jul 15, 2015

A couple things:

1. Im no authority in the field, so feel free to disregard anything I say as it is only based on my experience .

2. I dont see any reason why you couldnt do something like youve described, but it almost seemes more like supervised learning than reinforcement learning since you are guiding the algorithm to a particular outcome....i know i said there isnt a huge difference between the two...but there are some.

3. If you really are doing suoervised learning, the algorithm is supposed to learn from its successes and mistakes. If you are allowing online learning on the real market, your algoritm should eventually figure out the conditions that have positive and negative expectancy without your guidance, hopefully without a large drawdown. If you are backtesting, your fill model should account for this latency competition.

I like where your heads at, but if im understanding you correctly, i think you might get better results with a higher fidelity fill modek than with learning constraints

IAS_LLC · Jul 16, 2015

In bullet 3, I meant reinforcement learning, not supervised

cjbuckley4 · Jul 16, 2015

1. That's fine. I'm no expert either. I've taken two 'graduate level' machine learning courses at school and that is the extent of my knowledge. No reinforcement learning was discussed.

2. You're correct that we're guiding the algorithm to a given outcome, but won't that be the case in any situation where we argmax/min the cost function? I suppose max min of a function is different than learning from successive results. I'm not sure that what I'm suggesting is necessarily supervised learning because supervised learning provides the learning algorithm with a dataset of input and output pairs known to be correct in my experience. This would not be the case in my example. You're correct that what I've suggested doesn't actually learn from successive outcomes, so in that sense it is indeed not reinforcement.

3. I agree that modeling whether you get a fill is the responsibility of your backtesting application. That being said, I think in practice still steering my algorithm toward less latency sensitive opportunities may be useful.

I'm just feeling the concept out at this point, seeing how useful it might be. Do you know of any references that cover trading or timeseries learning?