Summary
I recently came across JustHTML, a new Python library for parsing HTML released by Emil Stenström. It’s a very interesting piece of software, both as a useful library and as a case study in sophisticated AI-assisted programming.
[...]
[…] A few highlights:
He hooked in the 9,200 test html5lib-tests conformance suite almost from the start. There’s no better way to construct a new HTML5 parser than using the test suite that the browsers themselves use.
He picked the core API design himself—a TagHandler base class with handle_start() etc. methods—and told the model to implement that.
He added a comparative benchmark to track performance compared to existing libraries like html5lib, then experimented with a Rust optimization based on those initial numbers. He threw the original code away and started from scratch as a rough port of Servo’s excellent html5ever Rust library.
He built a custom profiler and new benchmark and let Gemini 3 Pro loose on it, finally achieving micro-optimizations to beat the existing Pure Python libraries. He used coverage to identify and remove unnecessary code.
He had his agent build a custom fuzzer to generate vast numbers of invalid HTML documents and harden the parser against them.