Last Friday, OpenAI unveiled Codex, a new coding system designed to execute intricate programming tasks using natural language commands. With Codex, OpenAI joins a select group of coding tools that are evolving to enhance automation in software development.
Traditional AI coding assistants, such as GitHub’s Copilot, function primarily as an advanced autocomplete feature within integrated development environments, requiring users to interact with the generated code directly. However, the emergence of new agentic coding tools—like Devin, SWE-Agent, and OpenHands—aim to revolutionise this approach by eliminating the need for users to interface with the code altogether. These tools function more like project managers, allowing users to assign tasks through platforms like Asana or Slack and return to find solutions upon completion.
Prominent voices in AI see this shift as a natural evolution towards greater automation in programming tasks. Kilian Lieret, a Princeton researcher and part of the SWE-Agent team, reflects on the journey from manual coding to auto-complete systems and now to the concept of autonomous agents tackling coding problems independently. The ambition lies in delegating tasks entirely to these agents, who would autonomously resolve issues like bug reports without the user’s direct involvement.
Despite the promise, transitioning to fully autonomous systems presents challenges, as evidenced by recent critiques of Devin, which faced harsh evaluations following its general availability. Users reported that overseeing these systems could become as cumbersome as manual coding due to numerous errors. Nonetheless, the financial backing for such tools—evidenced by Cognition AI’s sizeable funding—suggests confidence in their evolution.
While advocates praise the potential of these coding agents, they caution against complete reliance without human oversight. Experts like Robert Brennan, CEO of All Hands AI, highlight the necessity for human intervention to review code quality and prevent chaotic outcomes arising from unchecked auto-approvals.
An ongoing issue is the phenomenon of “hallucinations,” where tools generate misleading information. All Hands AI is actively developing measures to mitigate these risks, but comprehensive solutions remain elusive.
To gauge progress in agentic programming, benchmarks like SWE-Bench allow developers to evaluate their models against unresolved challenges from open-source repositories. Currently, OpenHands leads with a 65.8% problem-solving success rate, while OpenAI claims its model, codex-1, achieves 72.1%—a figure yet to be independently verified. Concerns linger, however, regarding the practicality of these scores translating into meaningful, independent coding capabilities, especially for complex systems needing multi-stage solutions.
The community hopes that advancements in foundational models will enhance the reliability of agentic coding tools, ultimately easing the burden on developers. Crucial to this evolution will be addressing reliability issues such as hallucinations and figuring out the level of trust that can be safely delegated to these agents. As Brennan notes, the challenge remains: how much can we entrust these technologies without undermining the quality of our work?
Fanpage:Â TechArena.au
Watch more about AI – Artificial Intelligence


