← HomeLogin
April 2026 Bluesky outage post-mortem
~devblueskygolangoutagespostmortems
pckt.blog Apr 10, 2026

Summary

Hey all! I'm Jim, and I do system-y things at Bluesky. I'm here to give you some details about what happened on Monday of this week that caused Bluesky to go down intermittently for ~1/2 our users for about 8 hours.

[...]

This particular RPC (GetPostRecord) takes a batch of post URIs, and looks them all up in memcached, then scylla upon cache miss. What I had missed is that we deployed a new internal service last week that sent less than three GetPostRecord requests per second, but it did sometimes send batches of 15-20 thousand URIs at a time. Typically, we'd probably be doing between 1-50 post lookups per request.

Every RPC handler in the data plane does bounded concurrency (i.e. errgroup.SetLimit). However, this endpoint did not! It was the only endpoint in the entire system that was missing it.

That means that we'd launch 15-20 thousand goroutines for the request, slam the daylights out of memcached by dialing a ton of connections, then close and return them to the OS since our max idle conn pool size was 1000. They would build up in the TCP TIME_WAIT state, and exhaust all available ports.