Text
Contextual Bandits with Cross-Learning
In the classic contextual bandits problem, in each round t, a learner observes some context c, chooses some action i to perform, and receives some reward ri,t(c)
. We consider the variant of this problem in which in addition to receiving the reward ri,t(c)
, the learner also learns the values of ri,t(c′)
for some other contexts c′
in set Oi(c)
, that is, the rewards that would be achieved by performing that action under different contexts c′∈Oi(c)
. This variant arises in several strategic settings, such as learning how to bid in nontruthful repeated auctions, which has gained a lot of attention lately as many platforms have switched to running first price auctions. We call this problem the contextual bandits problem with cross-learning. The best algorithms for the classic contextual bandits problem achieve O˜(CKT−−−−√)
regret against all stationary policies, in which C is the number of contexts, K the number of actions, and T the number of rounds. We design and analyze new algorithms for the contextual bandits problem with cross-learning and show that their regret has better dependence on the number of contexts. Under complete cross-learning in which the rewards for all contexts are learned when choosing an action, that is, set Oi(c)
contains all contexts, we show that our algorithms achieve regret O˜(KT−−−√)
, removing the dependence on C. For any other cases, that is, under partial cross-learning in which |Oi(c)|
Barcode | Tipe Koleksi | Nomor Panggil | Lokasi | Status | |
---|---|---|---|---|---|
art147639 | null | Artikel | Gdg9-Lt3 | Tersedia namun tidak untuk dipinjamkan - No Loan |
Tidak tersedia versi lain