Your use of Dynamic Programming has peaked my interest I use it quite a bit in my day job (aerospace control systems). Would you mind sharing some of the details?
The dynamic programming approach that I am currently following is quite similar to Q-learning. The prices and quantities of a set of limit orders embody the action to be taken by the policy.
The state in my execution algorithm is a vector of attributes that describe the current configuration of my system. Two of these attributes are (i) the time elapsed since the trading signal was generated by the decision engine -- its maximum value is the execution time horizon which is 5 minutes in my case, and (ii) the number of shares that are left to execute in the target quantity.
Other important attributes are a set of market variables which are functions of the price and volume of the traded instrument. Among them are predictions of the price 1, 2, 3, 4 and 5 minutes into the future. I apply adaptive filter techniques in combination with discrete wavelet filtering (to reduce noise) for this. To reduce the computational load, I make it an absolute top priority to cast all my algorithms in recursive / online forms (i.e. I avoid rolling window type batch learning approaches).
Other than Q-learning type dynamic programming approaches I am also looking into methods that are based on direct adaptive control techniques. One idea that I have been working on is to find analytic expressions for the average execution price at the end of the 5 min horizon, which are differentiable. For this I seek approximate expressions that are smooth, i.e. have no discontinuities. Having a policy function approximation at my disposal I can then use gradient descent techniques to adaptively tune its parameters by recursion. A problem is that the analytic expressions, which I derive with the help of the symbolic computation software package Mathematica, become very unwieldy and complex, i.e. their computational load during live trading can become a bottleneck.
