Poster
SWE-bench: Can Language Models Resolve Real-world Github Issues?
Carlos E Jimenez · John Yang · Alexander Wettig · Shunyu Yao · Kexin Pei · Ofir Press · Karthik Narasimhan
Halle B
Wed 8 May 6:45 a.m. PDT — 7:30 a.m. PDT
Language models (LMs) have been improving rapidly, and today we lack benchmarks that are hard to solve but easy to evaluate. Coding is such a desired task, but existing coding benchmarks only feature self-contained problems solvable within tens of lines. Inspired by how real-world programmers code to fix bugs or ship new features, we introduce SWE-bench, a benchmark with 2,294 GitHub issues sourced from 12 popular Python repositories. Given a codebase and an issue description, an LM is tasked with editing the codebase to resolve the issue and pass all related tests. Our experiments show that both state-of-the-art proprietary LMs and our fine-tuned LM, SWE-Llama, can resolve only the simplest issues. For example, Claude 2 and GPT-4 solve a mere 3.6% and 1.3% of tasks respectively, even when provided with an oracle retriever. Through systematic analysis, we identify various factors underlying LM performances, such as the retrieval setup, codebase size, and issue complexity. We also identify key challenges for LMs to solve real-world software engineering problems, including understanding cross-file dependencies, localizing edit locations, and generating long and well-formatted patch files. SWE-bench shows that real-world software engineering is a diverse, challenging and sustainable testbed for evaluating a wide range of language model abilities.