#### Faculty Candidate Seminar

# Simulating Markov Decision Processes

Add to Google Calendar

Many complex systems can be modeled as Markov decision processes and games. Despite impressive advances in computational technology and methods, often, our only recourse to their analysis and design is through computer simulations. In many problems, even the system parameters may be unknown. In such cases, we would like to obtain uniform estimates of quantities of interest such as the value function.

The value function of a Markov decision process assigns to each policy its expected discounted reward. This expected reward can be estimated as the empirical average of the reward over many independent simulation runs. We consider a non-parametric framework and derive bounds on the number of runs needed for the uniform convergence of the empirical average to the expected reward uniformly for a class of policies, in terms of the P-dimension of the policy class. Surprisingly, how we simulate an MDP seems to matter. There are good simulation models and bad ones. Uniform convergence results are also obtained for the average reward case, for some partially observed processes, and for Markov games. The results can be viewed as a contribution to empirical process theory and as an extension of the probably approximately correct (PAC) learning theory for partially observable MDPs and Markov games.

In the last part of the seminar, I will talk about a different problem: Efficient network resource allocation through auctions. I will introduce the combinatorial Seller's Bid Double Auction (c-SeBiDA) for bandwidth and spectrum allocation, and show that, despite strategic behavior of player's, the auction outcome is efficient, i.e., the price of anarchy is zero.