OpenAI’s ChatGPT chatbot can fix software bugs very well, but its key advantage over other methods and AI models is its unique ability for dialogue with humans that allows it to improve the correctness of an answer.
Researchers from Johannes Gutenberg University Mainz and University College London pitted OpenAI’s ChatGPT against “standard automated program repair techniques” and two deep learning approaches to program repairs: CoCoNut, from researchers at the University of Waterloo, Canada; and Codex, OpenAI’s GPT-3 based model that underpins GitHub’s Copilot paired programming auto code completion service.
Also: What is ChatGPT and why does it matter? Here’s everything you need to know
“We find that ChatGPT’s bug fixing performance is competitive to the common deep learning approaches CoCoNut and Codex and notably better than the results reported for the standard program repair approaches,” the researchers write in a new arXiv paper, first spotted by New Scientist.
That ChatGPT can solve coding problems isn’t new, but the researchers highlight that its unique capacity for dialogue with humans gives it a potential edge over other approaches and models.
The researchers tested ChatGPT’s performance using the QuixBugs bug fixing benchmark. The automated program repair (APR) systems appear to be at a disadvantage as they were developed prior to 2018.
ChatGPT is based on the transformer architecture, which Meta’s AI chief Yann LeCunn highlighted this week was developed by Google. Codex, CodeBERT from Microsoft Research, and its predecessor BERT from Google are all based on Google’s transformer method.
OpenAI highlights ChatGPT’s dialogue capability in examples for debugging code where it can ask for clarifications, and receive hints from a person to arrive at a better answer. It trained the large language models behind ChatGPT (GPT-3 and GPT 3.5) using using Reinforcement Learning from Human Feedback (RLHF).
While ChatGPT’s ability for discussion can help arrived at a more correct answer, the quality of its suggestions remain unclear, the researchers note. That’s why they wanted to evaluate ChatGPT’s bug fixing performance.
The researchers tested ChatGPT against QuixBugs 40 Python-only problems, and then manually checked whether the suggestion solution was correct or not. They repeated the query four times because there is some randomness in the reliability of ChatGPT’s answers, as a Wharton professor found out after putting the chatbot through an MBA-like exam.
ChatGPT solved 19 of the 40 Python bugs, putting it on par with CoCoNut (19) and Codex (21). But standard APR methods only solved seven of the issues.
The researchers found that ChatGPT’s success rate with follow-up interactions reached 77.5%.
The implications for developers in terms of effort and productivity are ambiguous though. Stack Overflow recently banned ChatGPT-generated answers because they were low quality but plausible sounding. The Wharton professor found that ChatGPT could be a great companion to MBA students as it can play a “smart consultant” — one who produces elegant but oftentimes wrong answers — and foster critical thinking.
“This shows that human input can be of much help to an automated APR system, with ChatGPT providing the means to do so,” the researchers write.
“Despite its great performance, the question arises whether the mental cost required to verify ChatGPT answers brings.”