On Mediating Policies
Add to Google Calendar
Faced with a difficult task, an agent may have access to a number of suggestions about how to behave, which could aid it in doing well quickly. These might come as advice from an external source, or from knowledge gained while solving similar problems. The agent's goal then would be to quickly identify the best source of advice while avoiding, as much as possible, taking bad advice.
We introduce a setting in which a mediator agent must choose amongst a set of "experts' advising it on actions to take on an unknown and unobserved Markov Decision Process (MDP). We provide an algorithm which, when the experts are stationary policies and the MDP is unichain, will achieve a return that competes favorably with that of each expert in a number of steps polynomial in its mixing time and other natural parameters. We also present empirical results that illustrate the strengths and weaknesses of our algorithm in practice and demonstrate its applicability in two transfer learning domains.