Switch edu-ID Incident Reports

This page is a summary of past incidents. Level 1 incidents (disruptions) are published here within 7 days, and level 2 incidents (emergencies) within 3 days.

Current information on ongoing incidents can be found at https://status.eduid.ch

28.2.2025 IdP Outage

Date, Time, Duration 28.2.2025, 09:12, 10 minutes
Severity Level 1 - Service Disruption
Incident Summary

All users encountered an timeout message during the login process

Affected users

Potentially every login during the time frame

Root cause analysis

Deployment of a faulty configuration on the load balancers lead to the unavailability of most Identity Provider nodes. The active nodes were not in a proper working state.

Resolution and recovery

The faulty configuration of the load balancers was rolled back as soon as our monitoring system send out an alert and the mistake was noticed. 

Preventive measures, future actions and other learnings
  • Prior coordination between involved teams before deploying new configuration

17.2.2025 Degraded Authentication Performance

Date, Time, Duration 17.2.2025, 10:10, 30 minutes
Severity level 2 - degraded performance
Incident Summary

Most users encountered a slow login process or saw a timeout message

Affected users

Potentially every login in the time frame

Root cause analysis

The combination of a temporary high load and a rolling upgrade of IdP nodes led to a degradation of the performance.

Only half of the IdP nodes were active after 10:00. The remaining nodes were not capable to handle the incoming login requests in an acceptable amount of time.

Resolution and recovery

Disable the overloaded nodes and switch over to the spare nodes. Since the request peak was already over at the time, the remaining nodes could handle the load. The previously overloaded nodes were also enabled again after they have recovered.

Preventive measures, future actions and other learnings
  • Review our maintenance practice. 
  • Consider adding/augmenting resources
  • Enhance coordination of rolling release planning with academic calendar.

12.2.2025 Partial MFA Outage

Date, Time, Duration 12.2.2025, 13:45, 20 minutes
Severity level 1 - service disruption
Incident Summary

Some users were not able to authenticate with MFA.

Affected users

About 40% of the users which needed to login with a second factor or passkey during the incident were affected, and they had to retry later

Root cause analysis

The IdP nodes authenticate to our internal APIs using secure mTLS authentication. The renewal of involved certificates led to authentication failures on two of five IdP nodes, due to a mismatch of machine identities in the API configurations. The configuration mismatch was caused by a coincidence involving the introduction of new IdP nodes and usage of a shared client certificate on all nodes.

Resolution and recovery

Rollback to the previous certificate configuration on the IdP instances, based on the documented rollback strategy

Preventive measures, future actions and other learnings
  • Improve the management of internal mTLS authentication, introduce centrally managed machine certificates
  • Improve error messages on the login services (the current messages mislead endusers to reset MFA, which was actually not needed)
  • Further improve monitoring and alerting

21.1.2025 Account Registration Outage

Date, Time, Duration 21.1.2025, 9:00, 3 hours
Severity level 1 - service disruption
Incident Summary

edu-ID users could not create a new account.

Affected users

Around 200 users were affected, and they had to retry later

Root cause analysis

The underlying cause was new version of PHP which prevented the account management to write back changes to the user database.

Resolution and recovery

Rolling back to previous version of PHP.

Preventive measures, future actions and other learnings
  • Improve alerting: the e2e tests did not cause an alarm during automated testing or after deployment.

16.12.2024 Partial SMS Provider Outage

Date, Time, Duration 16.12.2024, 10:00, 30 minutes
Severity level 1 - service disruption
Incident Summary

Some edu-ID users (but not all) didn't receive SMS messages for around 30 minutes (10:00-10:30). SMS were not delivered in bursts.

Affected users

This affected in particular the 2-step login for users without TOTP.

Root cause analysis

The SMS provider reported that "experienced unexpected network issue" caused some SMS not being sent.

Resolution and recovery

The problems occurred in bursts and not for all users. Therefore, by the time the edu-ID team was made aware of the issue by a few users, the issue was already resolved.  Also, due to a monitoring problem the issue was not reported earlier.

Preventive measures, future actions and other learnings
  • Automate switchover to alternative provider in case of problems.
  • Monitor the monitoring system and ensure it is working properly all the time.

 

19.9.2024 SMS Provider Outage

Date, Time, Duration 19.9.2024, 10:03, 30 minutes
Severity level 2 - service disruption
Incident Summary

edu-ID users didn't receive SMS messages for around 30 minutes (10:00-10:30). This affected in particular the 2-step login for users without TOTP.

Affected users

In total, there were 1793 unsent SMSes (present without delivery report in the logs) between 10:00 and 10:36 on 19.09.2024. In the same time range of the previous day, there were only 23 undelivered SMSes.

These requests are associated with 621 different mobile numbers. Thus, we can conclude that about 620 users didn't get their requested SMS for mobile verification.

Root cause analysis

The cause of the problem was a congestion in the delivery queue of our primary SMS provider.

Resolution and recovery

We could switch over to our alternative SMS provicer after half an hour, such that SMS messages could be sent again. The primary SMS provider solved the problem later. We switched back to the primary provider before noon.

Preventive measures, future actions and other learnings
  • Switching to the alternative SMS provider took too long. Streamline this process.
  • Alert was not sent to all edu-ID staff. Extend alerting channels.
  • The primary SMS provider was not aware of the issue. Switch needs a direct contact.
  • Collect separate SMS metrics for each provider.
  • Automate switchover to alternative provider in case of problems.

 

18.9.2024 SMS and eMails not set via internal API

Date, Time, Duration 18.9.2024, 2:00, 7.5 hours
Severity level 2 - service disruption
Incident Summary

No SMS are sent to edu-ID users via an internal API. This affected in particular the 2-step login for users without TOTP. Also, no e-mails were sent via the same API e.g. to reset passwords. TOTP authentication was not affected.

Affected users

In total, requests from 2687 different users failed. Thus, we can conclude that almost 2700 users saw at least one error during MFA login with SMS or mobile/email verification.

Root cause analysis

The problem was that the communication between two internal APIs failed due to an expired X.509 certificate whose automatic renewal failed.

Resolution and recovery

Manually restart the acme-cert-renewal services on all nodes of the internal api for the client certificates.

Preventive measures, future actions and other learnings
  • Verify and cleanup deployment of all APIs.
  • include certificate expiration/renewal in monitoring.
  • Implement monitoring and alerting for all critical APIs.

 

16.9.2024 edu-ID Login not working for some users

Date, Time, Duration 16.9.2024, 7:50, 2 hours
Severity level 2 - service disruption
Incident Summary

On Monday morning at around 8.00, start of fall semester for all Swiss universities, it was noticed the edu-ID login failed or was delayed for some users, while for others it went through smoothly.

Affected users

Because not all users were affected and many users eventually managed to login after a few attempts, it is difficult to estimate the number of users. But about 200 additional support tickets, several phone calls and direct emails were retrieved by the edu-ID team and the Switch front desk. It is estimated that several thousand users were affected.

Root cause analysis

There were several factors that played a role in this issue: The many user logins (about 5x higher than in past weeks) due to semester start and the increased usage of MFA were two of them. However, the actually relevant cause was a missing index on a database table that consumed a lot of CPU in combination with the above.

Even though load tests were performed on the internal MFA API before it was enabled in Spring 2024 and even though the MFA API has been used for months without problems, this problem remained hidden until a massive number of logins by many different users triggered it.

Resolution and recovery

The creation of a database index immediately solved the issue.

Preventive measures, future actions and other learnings
  • Improve PostgreSQL SLIs to identify more quickly database congestions
  • Check for other missing indices in database
  • Improve SLIs and alerting of other critical components where possible
  • Speed up notification of users. Prepare messages that can quickly be published on the IdP and the load balancer
  • The incident was not displayed on https://status.switch.ch/ . Add "edu-ID login" to status page in addition to account management.
  • The internal incident management had to be improvised due to team personnel changes. Improve handover processes. Conduct trainings for sucessors. Update incident management kit and checklist.

 Glossary and Classification

Term

Level

Definition

Minor

0

An unplanned interruption to a service or a reduction in the quality of a service, with low impact on users, services or organizations. Level 0 incidents are not publicly reported.

Disruption

1

Partial or short term disruption to services or compliance.

Emergency

2

Significant and widespread disruption of service or compliance; Reputational damage, Damage to individuals, including SWITCH staff.

Crisis

3

A situation with serious strategic or reputational damage or where there is a credible risk to life or health of individuals. Some incidents trigger crises