This page is a summary of past incidents.
Current information on ongoing incidents can be found at https://status.switch.ch
Date, Time, Duration | 16.12.2024, 10:00, 30 minutes |
---|---|
Severity | level 1 - service disruption |
Incident Summary |
Some edu-ID users (but not all) didn't receive SMS messages for around 30 minutes (10:00-10:30). SMS were not delivered in bursts. |
Affected users |
This affected in particular the 2-step login for users without TOTP. |
Root cause analysis |
The SMS provider reported that "experienced unexpected network issue" caused some SMS not being sent. |
Resolution and recovery |
The problems occurred in bursts and not for all users. Therefore, by the time the edu-ID team was made aware of the issue by a few users, the issue was already resolved. Also, due to a monitoring problem the issue was not reported earlier. |
Preventive measures, future actions and other learnings |
|
Date, Time, Duration | 19.9.2024, 10:03, 30 minutes |
---|---|
Severity | level 2 - service disruption |
Incident Summary |
edu-ID users didn't receive SMS messages for around 30 minutes (10:00-10:30). This affected in particular the 2-step login for users without TOTP. |
Affected users |
In total, there were 1793 unsent SMSes (present without delivery report in the logs) between 10:00 and 10:36 on 19.09.2024. In the same time range of the previous day, there were only 23 undelivered SMSes. These requests are associated with 621 different mobile numbers. Thus, we can conclude that about 620 users didn't get their requested SMS for mobile verification. |
Root cause analysis |
The cause of the problem was a congestion in the delivery queue of our primary SMS provider. |
Resolution and recovery |
We could switch over to our alternative SMS provicer after half an hour, such that SMS messages could be sent again. The primary SMS provider solved the problem later. We switched back to the primary provider before noon. |
Preventive measures, future actions and other learnings |
|
Date, Time, Duration | 18.9.2024, 2:00, 7.5 hours |
---|---|
Severity | level 2 - service disruption |
Incident Summary |
No SMS are sent to edu-ID users via an internal API. This affected in particular the 2-step login for users without TOTP. Also, no e-mails were sent via the same API e.g. to reset passwords. TOTP authentication was not affected. |
Affected users |
In total, requests from 2687 different users failed. Thus, we can conclude that almost 2700 users saw at least one error during MFA login with SMS or mobile/email verification. |
Root cause analysis |
The problem was that the communication between two internal APIs failed due to an expired X.509 certificate whose automatic renewal failed. |
Resolution and recovery |
Manually restart the acme-cert-renewal services on all nodes of the internal api for the client certificates. |
Preventive measures, future actions and other learnings |
|
Date, Time, Duration | 16.9.2024, 7:50, 2 hours |
---|---|
Severity | level 2 - service disruption |
Incident Summary |
On Monday morning at around 8.00, start of fall semester for all Swiss universities, it was noticed the edu-ID login failed or was delayed for some users, while for others it went through smoothly. |
Affected users |
Because not all users were affected and many users eventually managed to login after a few attempts, it is difficult to estimate the number of users. But about 200 additional support tickets, several phone calls and direct emails were retrieved by the edu-ID team and the Switch front desk. It is estimated that several thousand users were affected. |
Root cause analysis |
There were several factors that played a role in this issue: The many user logins (about 5x higher than in past weeks) due to semester start and the increased usage of MFA were two of them. However, the actually relevant cause was a missing index on a database table that consumed a lot of CPU in combination with the above. Even though load tests were performed on the internal MFA API before it was enabled in Spring 2024 and even though the MFA API has been used for months without problems, this problem remained hidden until a massive number of logins by many different users triggered it. |
Resolution and recovery |
The creation of a database index immediately solved the issue. |
Preventive measures, future actions and other learnings |
|