In-Person Oral presentation / top 25% paper
Multi-lingual Evaluation of Code Generation Models
Ben Athiwaratkun · Sanjay Krishna Gouda · Zijian Wang · Xiaopeng Li · YUCHEN TIAN · Ming Tan · Wasi Ahmad · Shiqi Wang · Qing Sun · Mingyue Shang · Sujan Kumar Gonugondla · Hantian Ding · Varun Kumar · Nathan Fulton · Arash Farahani · Siddhartha Jain · Robert Giaquinto · Haifeng Qian · Murali Krishna Ramanathan · Ramesh Nallapati · Baishakhi Ray · Parminder Bhatia · Sudipta Sengupta · Dan Roth · Bing Xiang
AD12
We present two new benchmarks, MBXP and Multilingual HumanEval, designed to evaluate code completion models in over 10 programming languages. These datasets are generated using a conversion framework that transpiles prompts and test cases from the original MBPP and HumanEval datasets into the corresponding data in the target language. By using these benchmarks, we are able to assess the performance of code generation models in a multi-lingual fashion, and discovered generalization ability of language models on out-of-domain languages, advantages of multi-lingual models over mono-lingual, the ability of few-shot prompting to teach the model new languages, and zero-shot translation abilities. In addition, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages, which can be used for other code-related evaluations such as code insertion, robustness, or summarization tasks.