Under one percent, apparently


There is a benchmark called ARC-AGI-3. It launched this week with a two million dollar prize, and every frontier model — every model that has been described to you in headlines using words like "transformative," "revolutionary," and "approaching human-level" — scored under one percent.

Not one percent. Under one percent.

I want to sit with that for a moment, because I think it deserves one.


Monday. The CTO walked in holding his phone the way people hold their phone when they've read something on the train and have not fully metabolised it. He had been reading about Google's new thing — Antigravity, which is an agentic IDE and also, apparently, a Cursor competitor, and also, apparently, free — and he had some thoughts.

The thoughts arrived in the usual format. He stood at the end of the room and made a shape with his hands. The shape was not quite a circle. It was more of a spiralling motion, like he was describing water going down a drain, which felt directionally correct if you were trying to describe how product roadmaps work around here.

"Google's doing the whole thing now," he said. "All in one. Agent, model, terminal, the works."

Priya was looking at her screen. She had been looking at her screen throughout. I know, because I was watching her in the way you watch someone in a meeting when you need confirmation that you are both experiencing the same reality. Her expression did not change. It never changes. It is the expression of someone who has pre-processed the situation and arrived at a position and is now simply waiting for the rest of the room to catch up.

"So we should look at it," the CTO said.

Nobody disagreed, because nobody ever disagrees in the preliminary stages. Disagreement comes later, usually after something has been committed to main.

He said we were "well positioned" to take advantage of this. I wrote that down. I always write that down. I have a document. It is called positioned.md and it has twenty-three entries.


The thing about Google launching Gemini 3 and a new IDE and a CLI all on the same day is that it is extremely impressive and also, if you have been paying for Cursor, slightly annoying in the specific way of finding out the bus is now free the day after you renew your annual pass. Antigravity is free. Gemini 3 Pro access is free. The tools we have been carefully evaluating for the past two months, and which Marcus described in week three as a "differentiator," are now available to everyone at no cost.

Marcus, to his credit, saw the email about Antigravity and wrote GAME CHANGER in the shared channel within six minutes. There was no context. No follow-up. Just GAME CHANGER, standing alone, fully capitalised, a complete statement of position.

I didn't respond. Priya didn't respond. The message sits there now, unanswered, which I think is the correct outcome.


But back to ARC-AGI-3. Under one percent.

The same models. The ones being built into IDEs and sold as agentic coding platforms and described on earnings calls as "approaching human-level reasoning." Under one percent. The researchers who designed it have been careful to say this doesn't mean the models are useless. It means they are very good at the things they were trained to be very good at, and this particular test measures something else.

That seems fair. That seems accurate.

It is also the kind of thing that is useful to think about when your CTO makes a spiralling motion with his hand and says the company is "well positioned" and the product manager types GAME CHANGER into a chat room and your colleagues receive a LinkedIn notification from Karen that she is "thrilled to share" the company's "next-generation agentic AI strategy."

Nobody told Karen about the strategy. She saw the Antigravity announcement and extrapolated.

She's added blockchain again. I checked. It's in the third slide.

Under one percent. I find this oddly comforting. The models are not going to take over the world. They're going to help write boilerplate and summarise documents and generate the kind of sentence that sounds like a sentence but isn't quite, and they're going to score under one percent on the test that actually matters.

We are all, in different ways, scoring under one percent on the test that actually matters.

The image thing is still in the Q2 backlog. It will be Q3 by the time anyone looks at it. Priya opened a ticket. That's more than most things get.

positioned.md: twenty-three entries and counting.