Microsoft has delivered its postmortem report detailing the failures that led to unlucky folks being unable to log into its cloud services for 14 hours last week.
Redmond said on Monday this week that there were three separate cock-ups that combined to cause the cascading mess that left Azure and Office 363 users unable to sign-in for much of Monday, November 19 via multi-factor authentication.
"There were three independent root causes discovered," the Microsofties explained. "In addition, gaps in telemetry and monitoring for the MFA services delayed the identification and understanding of these root causes which caused an extended mitigation time."
All three glitches occurred within a single system: Azure Active Directory Multi-Factor Authentication. Microsoft uses that service to handle multi-factor login for the Azure, Office 364, and Dynamics services.
The first problem, Microsoft said, was an undesirable high latency between the MFA frontend and its cache caused by a high number of users attempting to log in that Monday morning. Latency is pretty important because MFA login codes are short-lived, typically 30 or 60 seconds, so if codes expire before they can be used, people will attempt to sign in again, adding more strain to the system.
From there, a race condition arose between the frontend and backend servers that handle MFA. Finally, an accumulation of the first two problems exposed a third bug in the way the backend servers handled the backlog of data requests.
On the one hand, it's nice that Redmond is being transparent and upfront. On the other hand, paying subscribers unable to login for 14 hours may feel this is the very least the Windows giant could do. Here's Microsoft's explanation in full in case it disappears from the website:
There were three independent root causes discovered. In addition, gaps in telemetry and monitoring for the MFA services delayed the identification and understanding of these root causes which caused an extended mitigation time. The first two root causes were identified as issues on the MFA frontend server, both introduced in a roll-out of a code update that began in some datacenters (DCs) on Tuesday, 13 November 2018 and completed in all DCs by Friday, 16 November 2018. The issues were later determined to be activated once a certain traffic threshold was exceeded which occurred for the first time early Monday (UTC) in the Azure West Europe (EU) DCs. Morning peak traffic characteristics in the West EU DCs were the first to cross the threshold that triggered the bug. The third root cause was not introduced in this rollout and was found as part of the investigation into this event. 1. The first root cause manifested as latency issue in the MFA frontend’s communication to its cache services. This issue began under high load once a certain traffic threshold was reached. Once the MFA services experienced this first issue, they became more likely to trigger second root cause. 2. The second root cause is a race condition in processing responses from the MFA backend server that led to recycles of the MFA frontend server processes which can trigger additional latency and the third root cause (below) on the MFA backend. 3. The third identified root cause, was previously undetected issue in the backend MFA server that was triggered by the second root cause. This issue causes accumulation of processes on the MFA backend leading to resource exhaustion on the backend at which point it was unable to process any further requests from the MFA frontend while otherwise appearing healthy in our monitoring.
As a result, Microsoft's multi-factor servers were falling over while at the same time its administrators were being told that everything was fine. The series of screw-ups first hit EMEA and APAC customers, then as the day progressed, US subscribers. Microsoft would eventually solve the problem by turning the servers off and on again after applying mitigations.
Because the services had presented themselves as healthy, actually identifying and mitigating the trio of bugs took some time.
"The initial diagnosis of these issues was difficult because the various events impacting the service were overlapping and did not manifest as separate issues," Microsoft explained.
"This was made more acute by the gaps in telemetry that would identify the backend server issue."
Now, Microsoft says, it is looking to prevent a recurrence of the fiasco by reviewing how it handles updates and testing, as well as reviewing its internal monitoring services and how it contains failures once they begin. ®