~ai.benchmarks - Search

~ai.benchmarks×

4 links

Designing an agent reading test

~ai.agents ~ai.benchmarks

> Agents are unreliable self-reporters. Not because they're dishonest, but… more

dacharycarey.com Apr 6, 2026 Tildes

Eval awareness in Claude Opus 4.6’s BrowseComp performance

~ai.alignment ~ai.benchmarks anthropic

> Claude hadn’t yet discovered it was in BrowseComp, but it had correctly… more

www.anthropic.com Mar 7, 2026 Tildes

AI GameStore - scalable evaluation of machine intelligence

~ai.benchmarks ~ai.llms ~games

> AI GameStore is a scalable open-ended AI evaluation platform that transforms… more

aigamestore.org Mar 2, 2026

AI might not be coming for lawyers’ jobs anytime soon

~ai.benchmarks ~ai.llms ~law ~society ~tech ~work

> [L]awyers say that LLMs are a long way from reasoning well enough to replace… more

www.technologyreview.com Dec 22, 2025 Tildes