On June 1, 2021 Onshape suffered a service-wide outage starting at 14:49 UTC. The service was intermittently available for some users, but was not entirely stable until sometime before 18:36 UTC when the “all clear” was called.
The first automated notification was made to the Onshape Operations team at 14:49 UTC. The Operations team immediately began our Incident Response (IR) process.
The service presented customers with failures to sign in, failures to open models and frequent support code errors for users in all regions. The Onshape status page at https://status.onshape.com was updated at 14:53 UTC. Information about this incident and all historical incidents can be found on that site.
IR team members discovered that there were a very large number of a few specific server types running. One of our internal messaging systems and one of our database clusters were experiencing very high CPU load.
We began to suspect a runaway auto-scaling service to be the cause of at least some of the issues. Disabling the auto-scaling service for specific server type helped achieve some stability. Termination of the excessive server instances restored the service to normal operation.
The cause of the spike in instance creation (and the overloading of our messaging and database services) was a recent auto-scaling service change which was designed to accelerate the launching of some instances.
The auto-scaling service running this change started too many instances and overwhelmed several other parts of the service.
Increase the resiliency of our messaging and database services against too many internal client sessions.
Add governors to our auto-scaling services and our deployment service to prevent runaway instance requests.
Some customers still don’t know about https://status.onshape.com. We should consider additional places in the UI to point users at the status page when we are experiencing failures.
We need to load test all of our services with too many database clients (servers), not just too many requests.
Overall the IR worked well. There were still customers that were unaware of the status page. We should add Twitter notifications for large outages. We need to decide if it’s the same criteria for updating the status page or if there is a higher bar. The primary goal of the tweet should be to point users at the status page.
We are truly sorry for the loss of productivity. We take the availability of the Onshape service seriously and we didn’t meet your expectations or ours. We will learn from this and use it to improve the service we provide to you.