Skip to yearly menu bar Skip to main content


Poster

AgentBench: Evaluating LLMs as Agents

Xiao Liu · Hao Yu · Hanchen Zhang · Yifan Xu · Xuanyu Lei · Hanyu Lai · Yu Gu · Hangliang Ding · Kaiwen Men · Kejuan Yang · Shudan Zhang · Xiang Deng · Aohan Zeng · Zhengxiao Du · Chenhui Zhang · Sheng Shen · Tianjun Zhang · Yu Su · Huan Sun · Minlie Huang · Yuxiao Dong · Jie Tang

Halle B

Abstract:

Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments.We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting.Our extensive test over 27 API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and OSS competitors.We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents.Training on code and high quality multi-turn alignment data could improve agent performance.Datasets, environments, and an integrated evaluation package for AgentBench are released.

Chat is not available.