Switch edu-ID Incident Reports

This page is a summary of past incidents. Level 1 incidents (disruptions) are published here within 7 days, and level 2 incidents (emergencies) within 3 days.

Current information on ongoing incidents can be found at https://status.eduid.ch

02.5.2025 Metadata / CRL issue

Date, Time, Duration	02.05.2025, 08:15, 90 minutes
Severity	Level 1 - Partial Service Disruption
Incident Summary	Due to a signing process being slow/blocked on the metadata signing hosts, an updated CRL could not be published in time. The old CRL still was valid but no newer CRL being published with the SAML2 metadata caused problems for older versions of the Shibboleth Identity Provider. These Identity Providers then discarded the SAML2 metadata althogether. Therefore, the Identity Provider could not login anymore users because it had no metadata anymore for the relying parties.
Affected users	Users of three self-hosted Identity Providers using an old Shibboleth version (3.3) could not login during max. 90 minutes. The affected Identity Providers fortunately have only few users and thanks to Friday morning after May 1st, most probably only few users were prevented from login. We have not received any tickets. No users logging in via edu-ID or using a more recent version of the Shibboleth Identity Provider were affected.
Root cause analysis	A bug in the CRL signing script caused an endless loop and a huge amount of garbage files. This prevented generating a new CRL.
Resolution and recovery	Fixing the bug in the CRL signing script, cleaning up the garbage files and restarting the script generated correct CRLs, which then were included in metadata. The affected Shibboleth Identity Providers reloaded these metadata files automatically after some time.
Preventive measures, future actions and other learnings	The script generating the CRLs is being improved Improve the SAML metadata and CRL monitoring

28.2.2025 IdP Outage

Date, Time, Duration	28.02.2025, 09:12, 10 minutes
Severity	Level 1 - Service Disruption
Incident Summary	All users encountered an timeout message during the login process
Affected users	Potentially every login during the time frame
Root cause analysis	Deployment of a faulty configuration on the load balancers lead to the unavailability of most Identity Provider nodes. The active nodes were not in a proper working state.
Resolution and recovery	The faulty configuration of the load balancers was rolled back as soon as our monitoring system send out an alert and the mistake was noticed.
Preventive measures, future actions and other learnings	Prior coordination between involved teams before deploying new configuration

17.2.2025 Degraded Authentication Performance

Date, Time, Duration	17.2.2025, 10:10, 30 minutes
Severity	level 2 - degraded performance
Incident Summary	Most users encountered a slow login process or saw a timeout message
Affected users	Potentially every login in the time frame
Root cause analysis	The combination of a temporary high load and a rolling upgrade of IdP nodes led to a degradation of the performance. Only half of the IdP nodes were active after 10:00. The remaining nodes were not capable to handle the incoming login requests in an acceptable amount of time.
Resolution and recovery	Disable the overloaded nodes and switch over to the spare nodes. Since the request peak was already over at the time, the remaining nodes could handle the load. The previously overloaded nodes were also enabled again after they have recovered.
Preventive measures, future actions and other learnings	Review our maintenance practice. Consider adding/augmenting resources Enhance coordination of rolling release planning with academic calendar.

12.2.2025 Partial MFA Outage

Date, Time, Duration	12.2.2025, 13:45, 20 minutes
Severity	level 1 - service disruption
Incident Summary	Some users were not able to authenticate with MFA.
Affected users	About 40% of the users which needed to login with a second factor or passkey during the incident were affected, and they had to retry later
Root cause analysis	The IdP nodes authenticate to our internal APIs using secure mTLS authentication. The renewal of involved certificates led to authentication failures on two of five IdP nodes, due to a mismatch of machine identities in the API configurations. The configuration mismatch was caused by a coincidence involving the introduction of new IdP nodes and usage of a shared client certificate on all nodes.
Resolution and recovery	Rollback to the previous certificate configuration on the IdP instances, based on the documented rollback strategy
Preventive measures, future actions and other learnings	Improve the management of internal mTLS authentication, introduce centrally managed machine certificates Improve error messages on the login services (the current messages mislead endusers to reset MFA, which was actually not needed) Further improve monitoring and alerting

21.1.2025 Account Registration Outage

Date, Time, Duration	21.1.2025, 9:00, 3 hours
Severity	level 1 - service disruption
Incident Summary	edu-ID users could not create a new account.
Affected users	Around 200 users were affected, and they had to retry later
Root cause analysis	The underlying cause was new version of PHP which prevented the account management to write back changes to the user database.
Resolution and recovery	Rolling back to previous version of PHP.
Preventive measures, future actions and other learnings	Improve alerting: the e2e tests did not cause an alarm during automated testing or after deployment.

16.12.2024 Partial SMS Provider Outage

Date, Time, Duration	16.12.2024, 10:00, 30 minutes
Severity	level 1 - service disruption
Incident Summary	Some edu-ID users (but not all) didn't receive SMS messages for around 30 minutes (10:00-10:30). SMS were not delivered in bursts.
Affected users	This affected in particular the 2-step login for users without TOTP.
Root cause analysis	The SMS provider reported that "experienced unexpected network issue" caused some SMS not being sent.
Resolution and recovery	The problems occurred in bursts and not for all users. Therefore, by the time the edu-ID team was made aware of the issue by a few users, the issue was already resolved. Also, due to a monitoring problem the issue was not reported earlier.
Preventive measures, future actions and other learnings	Automate switchover to alternative provider in case of problems. Monitor the monitoring system and ensure it is working properly all the time.

19.9.2024 SMS Provider Outage

Date, Time, Duration	19.9.2024, 10:03, 30 minutes
Severity	level 2 - service disruption
Incident Summary	edu-ID users didn't receive SMS messages for around 30 minutes (10:00-10:30). This affected in particular the 2-step login for users without TOTP.
Affected users	In total, there were 1793 unsent SMSes (present without delivery report in the logs) between 10:00 and 10:36 on 19.09.2024. In the same time range of the previous day, there were only 23 undelivered SMSes. These requests are associated with 621 different mobile numbers. Thus, we can conclude that about 620 users didn't get their requested SMS for mobile verification.
Root cause analysis	The cause of the problem was a congestion in the delivery queue of our primary SMS provider.
Resolution and recovery	We could switch over to our alternative SMS provicer after half an hour, such that SMS messages could be sent again. The primary SMS provider solved the problem later. We switched back to the primary provider before noon.
Preventive measures, future actions and other learnings	Switching to the alternative SMS provider took too long. Streamline this process. Alert was not sent to all edu-ID staff. Extend alerting channels. The primary SMS provider was not aware of the issue. Switch needs a direct contact. Collect separate SMS metrics for each provider. Automate switchover to alternative provider in case of problems.

18.9.2024 SMS and eMails not set via internal API

Date, Time, Duration	18.9.2024, 2:00, 7.5 hours
Severity	level 2 - service disruption
Incident Summary	No SMS are sent to edu-ID users via an internal API. This affected in particular the 2-step login for users without TOTP. Also, no e-mails were sent via the same API e.g. to reset passwords. TOTP authentication was not affected.
Affected users	In total, requests from 2687 different users failed. Thus, we can conclude that almost 2700 users saw at least one error during MFA login with SMS or mobile/email verification.
Root cause analysis	The problem was that the communication between two internal APIs failed due to an expired X.509 certificate whose automatic renewal failed.
Resolution and recovery	Manually restart the acme-cert-renewal services on all nodes of the internal api for the client certificates.
Preventive measures, future actions and other learnings	Verify and cleanup deployment of all APIs. include certificate expiration/renewal in monitoring. Implement monitoring and alerting for all critical APIs.

16.9.2024 edu-ID Login not working for some users

Date, Time, Duration	16.9.2024, 7:50, 2 hours
Severity	level 2 - service disruption
Incident Summary	On Monday morning at around 8.00, start of fall semester for all Swiss universities, it was noticed the edu-ID login failed or was delayed for some users, while for others it went through smoothly.
Affected users	Because not all users were affected and many users eventually managed to login after a few attempts, it is difficult to estimate the number of users. But about 200 additional support tickets, several phone calls and direct emails were retrieved by the edu-ID team and the Switch front desk. It is estimated that several thousand users were affected.
Root cause analysis	There were several factors that played a role in this issue: The many user logins (about 5x higher than in past weeks) due to semester start and the increased usage of MFA were two of them. However, the actually relevant cause was a missing index on a database table that consumed a lot of CPU in combination with the above. Even though load tests were performed on the internal MFA API before it was enabled in Spring 2024 and even though the MFA API has been used for months without problems, this problem remained hidden until a massive number of logins by many different users triggered it.
Resolution and recovery	The creation of a database index immediately solved the issue.
Preventive measures, future actions and other learnings	Improve PostgreSQL SLIs to identify more quickly database congestions Check for other missing indices in database Improve SLIs and alerting of other critical components where possible Speed up notification of users. Prepare messages that can quickly be published on the IdP and the load balancer The incident was not displayed on https://status.switch.ch/ . Add "edu-ID login" to status page in addition to account management. The internal incident management had to be improvised due to team personnel changes. Improve handover processes. Conduct trainings for sucessors. Update incident management kit and checklist.

Glossary and Classification

Term	Level	Definition
Minor	0	An unplanned interruption to a service or a reduction in the quality of a service, with low impact on users, services or organizations. Level 0 incidents are not publicly reported.
Disruption	1	Partial or short term disruption to services or compliance.
Emergency	2	Significant and widespread disruption of service or compliance; Reputational damage, Damage to individuals, including SWITCH staff.
Crisis	3	A situation with serious strategic or reputational damage or where there is a credible risk to life or health of individuals. Some incidents trigger crises