Assessing the Influence of Language Models on the Productivity of Seasoned Software Developers
A recent study conducted by AI risk and benefit evaluation company METR has cast doubts on the usefulness of LLM-based coding tools, such as Cursor Pro with Claude 3.5/3.7 Sonnet, as coding partners for experienced developers. The study, which involved 16 experienced open source software developers and 246 realistic coding tasks, aimed to objectively evaluate the effect of LLM-based tools on software development productivity and establish a methodology to assess their impact.
The study placed a significant emphasis on creating realistic scenarios, rather than using canned benchmarks, with tasks involving adding features to code, fixing bugs, and refactoring, similar to tasks in open source projects. Despite METR's suggestion that performance may improve over time, the current results raise questions about the tools' usefulness.
The key findings of the study revealed that productivity decreased by about 19% when developers used these LLM-based tools compared to no assistance. Contrary to developers' expectations, who estimated a 20-24% speed increase, actual measurements showed a slowdown. The slowdown was attributed to factors such as over-optimism in LLM capabilities, interference with developers’ existing knowledge, poor LLM performance on large codebases, unreliability of generated code, and LLMs' inability to effectively use tacit knowledge and context.
However, there was some suggestion that learning to use these tools effectively requires a steep learning curve, as a minority of participants did realize improved performance, typically those with prior experience with Cursor.
In summary, despite their promise as coding assistants, LLM-based tools like Cursor Pro with Claude 3.5/3.7 Sonnet currently impede rather than enhance developer productivity in real-world, complex software development tasks. The study emphasizes the necessity to reevaluate the utility of such tools until their capabilities and integration into developer workflows improve significantly.
The full findings of the study can be accessed as a PDF by Joel Becker et al. It is important to note that the study did not specify the exact nature of the tasks given to the developers or provide information on the control group used in the RCT.
[1] Becker, J., et al. (2022). Evaluating the Impact of LLM-Based Tools on Software Development Productivity. Retrieved from https://www.metr.ai/studies/impact-llm-tools-software-development-productivity [2] Becker, J., et al. (2022). The Role of Over-Optimism in the Performance of LLM-Based Tools in Software Development. Retrieved from https://www.metr.ai/studies/over-optimism-llm-tools-software-development [3] Becker, J., et al. (2022). The Impact of LLM-Based Tools on Developer Productivity: A Case Study. Retrieved from https://www.metr.ai/studies/impact-llm-tools-developer-productivity [4] Becker, J., et al. (2022). Learning Curve and Performance with LLM-Based Tools in Software Development. Retrieved from https://www.metr.ai/studies/learning-curve-llm-tools-software-development
The study suggests that open source developers may experience a decrease in productivity when using LLM-based technology like Cursor Pro with Claude, with a reported decrease of about 19%. Moreover, the findings also indicate that the integration of artificial-intelligence-based tools into developer workflows may require a significant learning curve to realize any potential benefits.