GPT-5.4 Thinking beats the human baseline at using a computer
A model that navigates real software better than the human baseline is the milestone autonomous coding agents have been waiting for.
OpenAI released GPT-5.4 Thinking on 5 March. The headline figure is on OSWorld-Verified, a benchmark that scores how well a model can actually drive software — clicking, typing, navigating real applications. GPT-5.4 Thinking scored 75.0%, ahead of a reported human baseline of 72.4% and a large jump over the prior generation.
Computer-use scores are the number to watch for anyone betting on agents that do work rather than describe it. A model that can operate the same tools you do, more reliably than the average person, is the difference between a demo and a co-worker.
The usual caution applies: a benchmark is a controlled environment, not your messy desktop. But the direction is unambiguous, and it is the reason coding agents stayed on our filter this quarter.