Skip to yearly menu bar Skip to main content


Poster

Skill-Mix: a Flexible and Expandable Family of Evaluations for AI Models

Dingli Yu · Simran Kaur · Arushi Gupta · Jonah Brown-Cohen · Anirudh Goyal · Sanjeev Arora

Halle B
[ ]
Fri 10 May 7:30 a.m. PDT — 9:30 a.m. PDT

Abstract: As the role of LLMs shifts from statistical modeling of language to serving as general-purpose AI agents, how should LLM evaluations change? Arguably, a key ability of an AI agent is to flexibly combine, as needed, the basic skills it has learned. This capability to combine skills plays an important role in (human) pedagogy and also in a recent paper on emergence phenomena (Arora & Goyal,2023). Our paper introduces an evaluation, Skill-Mix, to measure this capability. Using a list of $N$ skills the evaluator repeatedly picks random subsets of $k$ skills and asks the LLM to produce text combining that subset of skills. Since the number of subsets grows like $N^k$, for even modest $k$ this evaluation will, with high probability, require the LLM to produce text it has not seen in the training set. The paper develops a methodology for (a) designing and administering such an evaluation, and (b) automatic grading (plus spot-checking by humans) of the results using the open LLaMA-2 70b model as well as GPT-4. Administering a version of Skill-Mix to popular chatbots gave results that, while generally in line with prior expectations, contained surprises. We found sizeable differences in capabilities among models ---including suspected cases of ``cramming for the leaderboard''--- that had not been revealed by the (much simpler) evaluations used in popular LLM leaderboards. Our methodology can flexibly change to future models and model capabilities, by expanding the set of skills being tested and increasing $k$. We hope Skill-Mix (which will be publicly released, including all prompts and code) may grow into an eco-system of open evaluations for AI capabilities, including in multi-modal settings.

Chat is not available.