Onshape is unavailable

Incident Report for Onshape

Postmortem

Overview

On June 1, 2021 Onshape suffered a service-wide outage starting at 14:49 UTC. The service was intermittently available for some users, but was not entirely stable until sometime before 18:36 UTC when the “all clear” was called.

The first automated notification was made to the Onshape Operations team at 14:49 UTC. The Operations team immediately began our Incident Response (IR) process.

The service presented customers with failures to sign in, failures to open models and frequent support code errors for users in all regions. The Onshape status page at https://status.onshape.com was updated at 14:53 UTC. Information about this incident and all historical incidents can be found on that site.

IR team members discovered that there were a very large number of a few specific server types running. One of our internal messaging systems and one of our database clusters were experiencing very high CPU load.

Assessment and Actions

We began to suspect a runaway auto-scaling service to be the cause of at least some of the issues. Disabling the auto-scaling service for specific server type helped achieve some stability. Termination of the excessive server instances restored the service to normal operation.

Conclusions

The cause of the spike in instance creation (and the overloading of our messaging and database services) was a recent auto-scaling service change which was designed to accelerate the launching of some instances.

The auto-scaling service running this change started too many instances and overwhelmed several other parts of the service.

Technical Improvements to be Made

Increase the resiliency of our messaging and database services against too many internal client sessions.
Add governors to our auto-scaling services and our deployment service to prevent runaway instance requests.
Some customers still don’t know about https://status.onshape.com. We should consider additional places in the UI to point users at the status page when we are experiencing failures.
We need to load test all of our services with too many database clients (servers), not just too many requests.

Process Improvements to be Made

Overall the IR worked well. There were still customers that were unaware of the status page. We should add Twitter notifications for large outages. We need to decide if it’s the same criteria for updating the status page or if there is a higher bar. The primary goal of the tweet should be to point users at the status page.

‌

We are truly sorry for the loss of productivity. We take the availability of the Onshape service seriously and we didn’t meet your expectations or ours. We will learn from this and use it to improve the service we provide to you.

Posted Jun 10, 2021 - 22:12 EDT

Resolved

The incident has been resolved. A full Root Cause Analysis will be posted here and on the Onshape forum as soon as available.

Posted Jun 01, 2021 - 15:46 EDT

Monitoring

We have stabilized the service. Imports and export may still be delayed.

Posted Jun 01, 2021 - 14:47 EDT

Update

We are implementing a fix to stabilize the service now. We truly appreciate your patience as we work through the issues.

Posted Jun 01, 2021 - 14:15 EDT

Identified

We have seen an additional regression in one of our internal coordination services. We have all team members engaged in restoring service health.

Posted Jun 01, 2021 - 13:19 EDT

Monitoring

A fix has been implemented and we are monitoring the results. We may still see delayed import/export operations in the North America region as queues are processed. We will be providing additional reports here as the service continues to recover.

Posted Jun 01, 2021 - 12:46 EDT

Update

An internal coordination service is overloaded. We are working to resolve the issue now. Update in 30 minutes.

Posted Jun 01, 2021 - 12:23 EDT

Update

We are continuing to investigate the issue. We will provide an additional update here within 30 mins.

Posted Jun 01, 2021 - 11:48 EDT

Investigating

Our automated systems are reporting an increased error rate connecting to Onshape from all regions.

Posted Jun 01, 2021 - 10:53 EDT

This incident affected: Onshape CAD Service (https://cad.onshape.com) (North America, Western Europe, Southeast Asia, Australia, Northeast Asia).