Degraded Performance Across Fin and Inbox Services

Write-up

Summary

On June 1, 2026, a fault in our underlying AWS infrastructure caused a deterioration in our US database, affecting less than 2% of customers. The database became slow and eventually unresponsive for those customers. The earliest customer-visible symptom was a delay in US email notifications, which began at around 08:40 UTC.

During the period of customer-facing degradation, affected customers experienced the Inbox loading slowly or failing to load, conversation updates and assignments lagging or not applying, attributes failing to save, Messenger messages appearing not to send or receive, and Fin replies not arriving.

Recovery happened in stages. Engineers, working alongside our database provider, PlanetScale, restored the failing infrastructure. The failing infrastructure consisted of two database shards. The first of those recovered at 11:51 UTC, at which point roughly half of the affected customers began to recover. The second completed at 12:46 UTC, which ended the broad degradation and allowed us to resolve our public status page at 13:23 UTC.

No customer data was lost. The email notifications that were delayed had been queued for later delivery, not discarded. After the database was stable, we restarted processing for the queued email-notification jobs and article-statistics jobs, which record engagement metrics. Email notifications were delivered and article statistics were updated in full.

EU and AU regions were not affected at any point during this incident.

We understand that Fin is core to how you support your own customers, and that any disruption to the Inbox, the Messenger, or Fin directly affects your ability to deliver fast, reliable service. We sincerely apologize for the disruption this caused.

Root cause

Fin uses Planetscale (built on Vitess) as our high-scale database layer. To handle very high traffic, the majority of our core data is sharded, meaning it is divided across many independent database clusters so that load is spread out and no single cluster has to serve everyone. Each shard holds the data for a portion of our customers. The database in question contains 128 shards, which means approximately 0.78% of our customers reside on each shard.

Each shard is served by a primary and three replicas balanced across three availability zones. A group of nodes degraded in one availability zone. This caused automatic failovers that left the shards with unbalanced replica sets and degraded read capacity. This was not immediately identified and was not automatically resolved. Manual action was required from the PlanetScale team to get the shards back to a healthy state.

PlanetScale has provided its incident analysis, and we are actively working together to understand how we could have automatically identified that these shards were unhealthy.

Timeline (UTC)

08:40 - A group of nodes in a single AWS availability zone begins degrading, affecting shards in the US database
08:46 - Our email notification DLQ breaches its alarm threshold. An engineer is paged and an incident is declared
09:04 - After initial triage, the incident is escalated to the STRIKE team, who join the investigation. Focus is on email notification DLQ
10:10 - A second DLQ alarm fires for article statistics, pointing to a fault affecting multiple independent job pipelines sharing the same database infrastructure
10:24 - Incident commander is paged, escalating to a database on-call engineer for support
10:30 - Confirmation that primary nodes on several shards showed near-zero activity between ~08:30 and ~09:30, indicating the nodes had failed rather than merely slowed
10:46 - Escalation ticket is opened with Planetscale
11:03 - Status page posted for the US region
11:14 - Planetscale confirms the failed nodes and identifies why the latency was lingering: the earlier automatic failovers had left the affected clusters unbalanced, with an availability zone holding no replica able to serve reads
11:28 - Mitigation begins replacing the failed nodes and rebuilding the affected shard infrastructure
11:51 - The first impacted shard finishes restoring and customers affected by that shard failure recover
12:46 - The second impacted shard finishes restoring and customers affected by that shard failure recover
12:53 - Proactively replacing remaining at-risk nodes on two further shards ahead of scheduled AWS maintenance
13:23 - Database latency is normal across all shards. The public status page is closed
16:09 - A follow-up to redrive DLQs is completed that processed delayed email notifications and delayed article view statistics during the incident
16:16 - The incident is resolved

Completed remediation

All affected database shards have been restored to full health, with failed nodes replaced and replicas re-established across availability zones.
Proactively replaced remaining at-risk nodes on two further shards ahead of scheduled AWS maintenance, rather than wait for that maintenance to potentially disturb them, removing the most likely path to a near-term recurrence.
The delayed email notifications and article statistics updates were re-driven from their holding queues once the database was stable and delivered in full. Customers did not need to manually resend email notifications, Messenger messages, or Fin replies.
PlanetScale has provided an RCA for the incident. We are reviewing and aligning on future next steps.

Ongoing improvements

We are adding our own independent monitoring for AWS health events affecting our database infrastructure so that we can act on node degradation and escalate more effectively.
PlanetScale has committed to evaluate automation that moves a database primary off a node as soon as AWS flags that node as degraded or scheduled for maintenance, before the node can fail completely. This is intended to reduce the window in which a degraded primary can cause write failures before a replica takes over.
PlanetScale has a feature request raised to add automation that detects and corrects an unbalanced cluster after a failover, ensuring that no availability zone is left without a replica able to serve reads following an automated primary promotion.

We take full responsibility for this incident. We understand the primary failure mode that caused the customer-visible impact. The affected infrastructure has been fully restored, and the highest-risk nodes have been proactively replaced. We are still investigating why the affected shards were not automatically identified as unhealthy and healed without manual intervention. PlanetScale is evaluating the automation improvements outlined above, and we are adding independent monitoring on our side.