Delayed import/export in Western Europe region
Incident Report for Onshape
Postmortem

Overview

On June 2, 2021 Onshape suffered a region specific partial outage in the eu-west-1 region where translations (import and export) were substantially delayed. This began at 05:34 UTC with an automated service check failure for eu-west-1 due to translation delays. Increasing the number of translator instances in eu-west-1, paradoxically, did not burn down the queue. Our overall translation rate increased with the additional instances.

Translations were successful but delays and failures were seen with insertion of the translated files into documents.

At 07:16 UTC a customer in the EU reported to support that the same translated file was being inserted into their document over and over again.

We eventually scaled to many translator instances in eu-west-1, but could not drive the queue length down. It was clear by 9:13 UTC that we were looping translations.

At 15:15 UTC we had realized that the translation events were getting requeued in our messaging service itself and that the additional translator instances were doing the requeueing. We had developed a hotfix, had it reviewed, built and installed on several translator nodes in eu-west-1 as a probable mitigation.

At 15:39 UTC the all-clear was declared. 

Assessment and Actions

A software component update that had been deployed as part of the 1.131 release on June 1 changed the behavior of our messaging service. This change was only seen at production scale.

A hotfix that disabled the new behavior was deployed to a few translators in eu-west-1 and the remainder were terminated. The translation queue quickly burned down.

Conclusions

The 1.131 deploy brought a new version of an internal software component with different behaviors when put under load. This change caused translation messages to be re-queued multiple times. The more translator instances we started, the worse the requeueing became.

We are truly sorry for the loss of productivity. We take the availability of the Onshape service seriously and we didn’t meet your expectations or ours. We will learn from this and use it to improve the service we provide to you.

Posted Jun 10, 2021 - 22:14 EDT

Resolved
The incident has been resolved. Import/export times have returned to normal in the Western Europe region.
Posted Jun 02, 2021 - 11:42 EDT
Update
We are receiving reports of some users getting multiple tabs created from a single import. We are continuing to work on this issue.
Posted Jun 02, 2021 - 10:19 EDT
Identified
We are working to process a backlog of import/export requests in the Western Europe region.
Posted Jun 02, 2021 - 04:21 EDT
This incident affected: Onshape CAD Service (https://cad.onshape.com) (Western Europe).