Poster
in
Workshop: Mathematical and Empirical Understanding of Foundation Models (ME-FoMo)
Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons
Banghua Zhu · Jiantao Jiao · Michael Jordan
Keywords: [ Pessimism ] [ reinforcement learning with human feedback (RLHF) ] [ Plackett-Luce model ] [ Bradley-Terry- Luce model ] [ maximum likelihood estimator ] [ offline reinforcement learning ]