Natural language autoencoders produce unsupervised explanations of LLM activations~ai.llms~research.papersanthropicinterpretability> We introduce Natural Language Autoencoders (NLAs), an unsupervised method for… moretransformer-circuits.pub 2 weeks ago
Signs of introspection in large language models~ai.chatbots~ai.llms~researchinterpretabilityintrospectionlong read> Have you ever asked an AI model what’s on its mind? Or to explain how it came… morewww.anthropic.com Oct 31, 2025Tildes