Skip to yearly menu bar Skip to main content


Poster

Beyond Stationarity: Convergence Analysis of Stochastic Softmax Policy Gradient Methods

Sara Klein · Simon Weissmann · Leif Döring

Halle B
[ ]
Fri 10 May 1:45 a.m. PDT — 3:45 a.m. PDT

Abstract: Markov Decision Processes (MDPs) are a formal framework for modeling and solving sequential decision-making problems. In finite time horizons such problems are relevant for instance for optimal stopping or specific supply chain problems, but also in the training of large language models. In contrast to infinite horizon MDPs optimal policies are not stationary, policies must be learned for every single epoch. In practice all parameters are often trained simultaneously, ignoring the inherent structure suggested by dynamic programming. This paper introduces a combination of dynamic programming and policy gradient called dynamical policy gradient, where the parameters are trained backwards in time. For the tabular softmax parametrization we carry out convergence analysis for simultaneous and step-wise training towards global optima, both in the exact and sampled gradient settings without regularization. It turns out that the use of dynamic programming much better exploits the structure of finite time problems which is reflected in improved convergence bounds. The constants in the error bounds can be improved, where the powers of the time horizon $H$ reduce from $5$ to $3$. Moreover, a model-dependent constant (which also appears in the convergence rate of the discounted setting and can be arbitrarily small) can be omitted for the step-wise policy gradient approach but not for the simultaneous approach.

Chat is not available.