Natural language autoencoders produce unsupervised explanations of LLM activations~ai.llms~papers~researchanthropicinterpretability> We introduce Natural Language Autoencoders (NLAs), an unsupervised method for… moretransformer-circuits.pub May 8, 2026
Signs of introspection in large language models~ai.chatbots~ai.llms~researchinterpretabilityintrospectionlong read> Have you ever asked an AI model what’s on its mind? Or to explain how it came… morewww.anthropic.com Oct 31, 2025Tildes