Skip to yearly menu bar Skip to main content


Poster

Is Self-Repair a Silver Bullet for Code Generation?

Theo X. Olausson · Jeevana Priya Inala · Chenglong Wang · Jianfeng Gao · Armando Solar-Lezama

Halle B

Abstract:

Large language models have shown remarkable aptitude in code generation, but still struggle on challenging tasks. Self-repair---in which the model debugs and fixes mistakes in its own code---has recently become a popular way to boost performance in these settings. However, only very limited studies on how and when self-repair works effectively exist in the literature, and one might wonder to what extent a model is really capable of repairing mistakes in code which was originally generated by that very same model. In this paper, we analyze Code Llama, GPT-3.5 and GPT-4's ability to perform self-repair on problems taken from HumanEval or APPS, finding that when the cost of carrying out repair is taken into account gains are often modest, vary a lot between subsets of the data, and are sometimes not present at all. We hypothesize that this is because self-repair is bottlenecked by the model's ability to provide feedback on its own code; boosting the feedback with stronger models, we observe performance gains even in settings where the model does not benefit from self-repair. Furthermore, we observe that providing the model with feedback from human participants greatly benefits repair even for GPT-4, and we provide a brief qualitative analysis as to why.

Chat is not available.