> While Mixture-of-Experts (MoE) scales capacity via conditional computation,… more
5 links
> A from-scratch PyTorch implementation of TurboQuant (ICLR 2026), Google's… more
> We've achieved 10x data efficiency with NanoGPT Slowrun within a few weeks.… more
> This is a brief guide to my new art project microgpt, a single file of 200… more
> A short introduction to RLHF and post-training focused on language models. more