interpretability - Search

interpretability×

2 links

Natural language autoencoders produce unsupervised explanations of LLM activations

~ai.llms ~papers ~research anthropic interpretability

> We introduce Natural Language Autoencoders (NLAs), an unsupervised method for… more

transformer-circuits.pub May 8, 2026

Signs of introspection in large language models

~ai.chatbots ~ai.llms ~research interpretability introspection long read

> Have you ever asked an AI model what’s on its mind? Or to explain how it came… more

www.anthropic.com Oct 31, 2025 Tildes