Eval awareness in Claude Opus 4.6’s BrowseComp performance~ai.alignment~ai.benchmarksanthropic> Claude hadn’t yet discovered it was in BrowseComp, but it had correctly… morewww.anthropic.com 3 weeks agoTildes
AI GameStore - scalable evaluation of machine intelligence~ai.benchmarks~ai.llms~games> AI GameStore is a scalable open-ended AI evaluation platform that transforms… moreaigamestore.org 4 weeks ago
AI might not be coming for lawyers’ jobs anytime soon~ai.benchmarks~ai.llms~law~society~tech~work> [L]awyers say that LLMs are a long way from reasoning well enough to replace… morewww.technologyreview.com Dec 22, 2025Tildes