29 May 20264 min read

Flags uncertainties, apparently

Lewis asked on Monday what the model does when it doesn't know something.

It was the kind of question that sounds simple until you consider that he's been here three weeks, that nobody gave him a proper onboarding, and that the honest answer is: it depends on which version of ARIA we're running, and we're not totally sure which version of ARIA we're running, and we have been not totally sure about this for longer than anyone is comfortable admitting.

I told him it usually flags the uncertainty. He wrote this down. I'm not sure what he wrote it down in.

Anthropic released Opus 4.8 this week. The headline detail, the thing they put at the top of the announcement, is that the model is "more likely to flag uncertainties about its work and less likely to make unsupported claims." Same price as before. Just better at knowing what it doesn't know.

There's a version of this that's a boring incremental release. There's another version where an AI company's biggest selling point for their most advanced model is "it's become more epistemically honest" — and you're sitting in a building that just appointed Lewis as the unofficial person who asks questions nobody else is asking anymore, and you find this funnier than you probably should.

Also this week: Anthropic raised sixty-five billion dollars and crossed a nine-hundred-and-sixty-five billion dollar valuation. The company that builds the model that flags its uncertainties is now worth more than OpenAI. Draw your own conclusions. I've drawn mine and they're mostly about the protein shake in the fridge that I am increasingly certain nobody is going to claim.

Priya pulled me into a room on Tuesday. Not a scheduled room — one of the booths by the stairwell that exist because someone once read an article about focus spaces and never followed up on whether they were actually being used.

She had her laptop open. ARIA. Image input, document in frame, output that made sense.

"Image thing," she said.

I looked at it. It was — fine. It was more than fine. She'd used the open-source multimodal endpoint we'd been ignoring since March, wired it to the existing summarisation pipeline, and it worked. Not "works pending further review." Works. The demo that Lewis had asked about three weeks ago, that Marcus had described to a client in April with creative licence, that had been a hand gesture since week one — Priya had built it overnight because she found the ticket annoying to look at.

"Does Marcus know?" I asked.

She looked at me.

Right.

We agreed to tell Marcus after we'd written it up properly. We agreed to tell the CTO after we'd told Marcus. We agreed that the order of operations here mattered and that getting it wrong would result in a roadmap item being created for something that already existed, possibly with a different name, possibly on a different board, possibly with a launch date in Q3.

The write-up took longer than the build.

Dave sent an email Thursday morning. New guidelines for the focus booths. Apparently there's a sign-in sheet now, linked to a form, which sends a confirmation email, which asks you to rate your focus experience from one to five. The system was built, the footer noted, "using available automation tools." Dave did not specify which tools. I am choosing not to ask.

Priya's review is currently sitting in a PR that Marcus has been tagged on since Tuesday afternoon. He has not opened it. His AI Actions count for the week is fourteen, which is the highest on the team, which is a metric that means nothing and which Amazon apparently cancelled for exactly this reason.

Lewis stopped by while I was writing the review comments. He looked at the PR title — feat: multimodal image input (closes #47, #112, #203) — and said "oh, is that the image thing?"

Yes, Lewis. That's the image thing.

He nodded like this was reasonable progress for something that has been a hand gesture for four months. Maybe it is. Maybe the bar for reasonable progress is lower than I've been setting it. Maybe Anthropic is onto something with this whole "flag the uncertainty and move on" approach.

ARIA does it now. We're getting there.

the image thing: closed. #47, #112, and #203. three tickets. one overnight build. four months of gesture.