A couple things:
1. Im no authority in the field, so feel free to disregard anything I say as it is only based on my experience .
2. I dont see any reason why you couldnt do something like youve described, but it almost seemes more like supervised learning than reinforcement learning since you are guiding the algorithm to a particular outcome....i know i said there isnt a huge difference between the two...but there are some.
3. If you really are doing suoervised learning, the algorithm is supposed to learn from its successes and mistakes. If you are allowing online learning on the real market, your algoritm should eventually figure out the conditions that have positive and negative expectancy without your guidance, hopefully without a large drawdown. If you are backtesting, your fill model should account for this latency competition.
I like where your heads at, but if im understanding you correctly, i think you might get better results with a higher fidelity fill modek than with learning constraints