Designing an agent reading test~ai.agents~ai.benchmarks> Agents are unreliable self-reporters. Not because they're dishonest, but… moredacharycarey.com Apr 6, 2026Tildes
Eval awareness in Claude Opus 4.6’s BrowseComp performance~ai.alignment~ai.benchmarksanthropic> Claude hadn’t yet discovered it was in BrowseComp, but it had correctly… morewww.anthropic.com Mar 7, 2026Tildes
AI GameStore - scalable evaluation of machine intelligence~ai.benchmarks~ai.llms~games> AI GameStore is a scalable open-ended AI evaluation platform that transforms… moreaigamestore.org Mar 2, 2026
AI might not be coming for lawyers’ jobs anytime soon~ai.benchmarks~ai.llms~law~society~tech~work> [L]awyers say that LLMs are a long way from reasoning well enough to replace… morewww.technologyreview.com Dec 22, 2025Tildes