Text
Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds
We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov decision process (MDP) is communicating with a finite, although unknown, diameter. Our main result is a high probability regret upper bound of O˜(DSAT−−−√) for any communicating MDP with S states, A actions, and diameter D. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy in time horizon T. This result closely matches the known lower bound of Ω(DSAT−−−−−√). Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest.
Barcode | Tipe Koleksi | Nomor Panggil | Lokasi | Status | |
---|---|---|---|---|---|
art144795 | null | Artikel | Gdg9-Lt3 | Tersedia namun tidak untuk dipinjamkan - No Loan |
Tidak tersedia versi lain