Switch edu-ID Incident Reports

This page is a summary of past incidents.

Current information on ongoing incidents can be found at https://status.switch.ch

19.9.2024 SMS Provider Outage

Date, Time, Duration 19.9.2024, 10:03, 30 minutes
Severity level 2 - service disruption
Incident Summary

edu-ID users didn't receive SMS messages for around 30 minutes (10:00-10:30). This affected in particular the 2-step login for users without TOTP.

Affected users

In total, there were 1793 unsent SMSes (present without delivery report in the logs) between 10:00 and 10:36 on 19.09.2024. In the same time range of the previous day, there were only 23 undelivered SMSes.

These requests are associated with 621 different mobile numbers. Thus, we can conclude that about 620 users didn't get their requested SMS for mobile verification.

Root cause analysis

The cause of the problem was a congestion in the delivery queue of our primary SMS provider.

Resolution and recovery

We could switch over to our alternative SMS provicer after half an hour, such that SMS messages could be sent again. The primary SMS provider solved the problem later. We switched back to the primary provider before noon.

Preventive measures, future actions and other learnings
  • Switching to the alternative SMS provider took too long. Streamline this process.
  • Alert was not sent to all edu-ID staff. Extend alerting channels.
  • The primary SMS provider was not aware of the issue. Switch needs a direct contact.
  • Collect separate SMS metrics for each provider.
  • Automate switchover to alternative provider in case of problems.

 

18.9.2024 SMS and eMails not set via internal API

Date, Time, Duration 18.9.2024, 2:00, 7.5 hours
Severity level 2 - service disruption
Incident Summary

No SMS are sent to edu-ID users via an internal API. This affected in particular the 2-step login for users without TOTP. Also, no e-mails were sent via the same API e.g. to reset passwords. TOTP authentication was not affected.

Affected users

In total, requests from 2687 different users failed. Thus, we can conclude that almost 2700 users saw at least one error during MFA login with SMS or mobile/email verification.

Root cause analysis

The problem was that the communication between two internal APIs failed due to an expired X.509 certificate whose automatic renewal failed.

Resolution and recovery

Manually restart the acme-cert-renewal services on all nodes of the internal api for the client certificates.

Preventive measures, future actions and other learnings
  • Verify and cleanup deployment of all APIs.
  • include certificate expiration/renewal in monitoring.
  • Implement monitoring and alerting for all critical APIs.

 

16.9.2024 edu-ID Login not working for some users

Date, Time, Duration 16.9.2024, 7:50, 2 hours
Severity level 2 - service disruption
Incident Summary

On Monday morning at around 8.00, start of fall semester for all Swiss universities, it was noticed the edu-ID login failed or was delayed for some users, while for others it went through smoothly.

Affected users

Because not all users were affected and many users eventually managed to login after a few attempts, it is difficult to estimate the number of users. But about 200 additional support tickets, several phone calls and direct emails were retrieved by the edu-ID team and the Switch front desk. It is estimated that several thousand users were affected.

Root cause analysis

There were several factors that played a role in this issue: The many user logins (about 5x higher than in past weeks) due to semester start and the increased usage of MFA were two of them. However, the actually relevant cause was a missing index on a database table that consumed a lot of CPU in combination with the above.

Even though load tests were performed on the internal MFA API before it was enabled in Spring 2024 and even though the MFA API has been used for months without problems, this problem remained hidden until a massive number of logins by many different users triggered it.

Resolution and recovery

The creation of a database index immediately solved the issue.

Preventive measures, future actions and other learnings
  • Improve PostgreSQL SLIs to identify more quickly database congestions
  • Check for other missing indices in database
  • Improve SLIs and alerting of other critical components where possible
  • Speed up notification of users. Prepare messages that can quickly be published on the IdP and the load balancer
  • The incident was not displayed on https://status.switch.ch/ . Add "edu-ID login" to status page in addition to account management.
  • The internal incident management had to be improvised due to team personnel changes. Improve handover processes. Conduct trainings for sucessors. Update incident management kit and checklist.