category

coding

1 story · latest 5 Mar

Editors, agents and review tools — what actually ships code, not just demos.

01

GPT-5.4 Thinking beats the human baseline at using a computer

OpenAI's reasoning model scored a record 75.0% on the OSWorld-Verified computer-use benchmark — past the reported 72.4% human baseline.

OpenAI
5 Mar
in our stackCursor
why this one →