Skip to yearly menu bar Skip to main content


Fast Value Tracking for Deep Reinforcement Learning

Frank Shih · Faming Liang

Halle B


Reinforcement learning tackles sequential decision-making problems by designing an agent that interacts with the environment. However, existing algorithms often treat the problem as static, calculating a point estimator for model parameters to achieve maximal expected reward (also known as value function) for the agent. They tend to overlook the stochastic nature of the agent-environment interaction system and the importance of uncertainty quantification associated with the model parameters. In our research, leveraging the Kalman filtering paradigm, we introduce a novel and scalable sampling algorithm called Langevinized Kalman Temporal-Difference (LKTD) for deep reinforcement learning. This algorithm, grounded in stochastic gradient Markov chain Monte Carlo (SGMCMC), efficiently draws samples from the posterior distribution of deep neural network parameters. Under mild conditions, we prove that the posterior samples generated by the LKTD algorithm converge to a stationary distribution. This convergence not only enables us to quantify uncertainties associated with the value function and model parameters, but also allows us to monitor these uncertainties during policy updates throughout the training phase. The LKTD algorithm paves the way for more robust and adaptable reinforcement learning approaches.

Chat is not available.