A system and method for online reinforcement learning is provided. In particular, a method for performing the explore-vs.-exploit tradeoff is provided. Although the method is heuristic, it can be applied in a principled manner while simultaneously learning the parameters and/or structure of the model...http://www.google.de/patents/US7707131?utm_source=gb-gplus-sharePatent US7707131 - Thompson strategy based online reinforcement learning system for action selection