coding

GPT-5.4 Thinking beats the human baseline at using a computer

A model that navigates real software better than the human baseline is the milestone autonomous coding agents have been waiting for.

By the desk · 22 June 2026 · 4 min read · OpenAI ↗

OpenAI released GPT-5.4 Thinking on 5 March. The headline figure is on OSWorld-Verified, a benchmark that scores how well a model can actually drive software — clicking, typing, navigating real applications. GPT-5.4 Thinking scored 75.0%, ahead of a reported human baseline of 72.4% and a large jump over the prior generation.

Computer-use scores are the number to watch for anyone betting on agents that do work rather than describe it. A model that can operate the same tools you do, more reliably than the average person, is the difference between a demo and a co-worker.

The usual caution applies: a benchmark is a controlled environment, not your messy desktop. But the direction is unambiguous, and it is the reason coding agents stayed on our filter this quarter.

tools mentioned

GPT-5.4 Thinking — OpenAI · 75.0% on OSWorld-Verified