Link to: Announcing 1-bit bonsai: the first commercially viable 1-bit LLMs

Summary

Today, we’re announcing PrismML, an AI lab building the most concentrated form of intelligence. Emerging from breakthrough research developed at Caltech, we’re guided by a core belief that the next major leaps in AI will be driven by order-of-magnitude improvements in intelligence density, not just sheer parameter count.

[...]

1-bit Bonsai 8B implements a proprietary 1-bit model design across the entire network: embeddings, attention layers, MLP layers, and the LM head are all 1-bit. There are no higher-precision escape hatches. It is a true 1-bit model, end to end, across 8.2 billion parameters.

Despite being 14x smaller than the 8B (16-bit) full-precision models in its parameter-count class, it performs competitively on standard benchmarks while operating at radically higher efficiency.

[...]

1-bit Bonsai 8B is only 1.15 GB. At that size, it is small enough to fit on an iPhone 17 Pro. Relative to models with similar performance, that represents roughly a 14x reduction in model size. That reduction is not cosmetic. It changes what hardware can host serious intelligence.

Across devices, Bonsai also delivers major throughput gains. On an M4 Pro Mac, it runs at 131 tokens per second. On an RTX 4090, it reaches 368 tokens per second. On an iPhone 17 Pro Max, it runs at roughly 44 tokens per second.

[...]

1-bit Bonsai 8B uses substantially less energy than its 16-bit full-precision counterparts, delivering roughly 4-5x better energy efficiency. On the M4 Pro, it requires 0.074 mWh/tok and on the iPhone 17 Pro Max, it only requires 0.068 mWh/tok.