Current state of SWITCHengines outages are available at https://switchengines-1564678076272.site24x7signals.com/
Chronological List of Past Outages
2020-05-11 - 2020-06-30 SWITCHengines Storage Incident
On May 11 we observed stability problems with the Ceph storage cluster in Lausanne. Increasing numbers of processes (OSDs) which handle replication, rebalancing and recovery of data were being killed to handle out of memory issues. We restored service, but after several hours, processes died again. From May 14-May 26th all hosts in the cluster were upgraded with additional RAM which improved the situation temporarily. This pattern continued as long as very large buckets of object data were present on the cluster. In parallel to running ongoing compaction of metadata databases, and deleting the problematic data, additional disks were added to the cluster starting from June 6th to increase available processes. Finally, between June 17th-June 30th the cluster was re-architected with an additional 6 servers dedicated to block/volume storage to isolate those use cases from performance problems, and the remainder of the problematic data was deleted.
During the worst periods (the weeks of May 11, May 25, June 1, June 8) both volume and object users were severely impacted. For certain periods during the week of May 11, the cluster was completely unavailable for several hours at a time. For the other critical periods, while for a significant part of this time less than 5% of the cluster placement groups were unavailable, if even part of the data required to mount a volume or access an object was impacted, the full operation would hang or fail so there was a negative customer impact greatly in excess of the simple reported availability.
For certain S3 use cases, access was explicitly disabled from May 18th, and their services were removed from the cluster and placed on alternative infrastructure. In addition, the stored data on the cluster for these use cases was deleted with their agreement from that time.
For all users of the cluster, from the week of May 25th until the end of June, all operations on the cluster were rate limited in order to aid recovery by controlling the amount of data placement operations needed.
For a subset of users from June 8th, performance degradation caused by necessary deletion operations required migration or dual-homing to Zürich.
On the weekend prior to May 11th, a disk had failed in the cluster which triggered a routine rebalancing operation where replicated data gets a 3rd copy made and placed, and erasure coded data gets a recalculated data chunk placed to restore resilience. This operation caused OSD processes to grow, filling up the memory on some of the servers they were running on, and causing the cluster to grind to a halt. OSDs which were dealing with the Erasure Coding (EC) (3) pool and the giant buckets using this pool were particularly resource hungry.
The buckets in question had begun to be used in 2019, but increasingly in 2020 with a use case based on Veeam backup. Over 850 million of the almost 1 billion objects in the Lausanne cluster were in only 10 buckets. In addition, the object sizes were very small, the largest bucket having 1MB objects and others at most 4MB objects. Due to erasure coding with an 8+3 configuration, each of these objects was then divided up into 11 chunks before being placed, which is a resource hungry operation. This put significant pressure on the metadata databases for the OSDs which then spilled out onto HDD rather than SSD for the majority of the operations needed. This made the needed operations slower and contributed to load. Individual placements of data then became shorter and shorter due to resource pressure, which created a loop in which the metadata databases grew even more, increasing the resource pressure in turn.
Resolution and mitigation required several steps in parallel.
Continuous rolling trimming of logs and compaction of metadata databases for the OSD processes so that the HDD use was kept as low as possible and resource required minimised, including weekends and evenings.
Rate limiting of IOPS to control the load on the cluster such that resources did not encounter a ‘thundering herd’ effect each time more processes became available, which would trigger a failure again.
Addition of RAM, disks and additional servers to spread the load
Refactoring of the cluster to separate out different use cases and isolate them from issues with erasure coding and object storage
Deletion of the problematic large buckets
Full clean and recompaction of over 600 OSD databases after the deletion of data.
Lessons Learned & Next Steps
The hardware resource consumption implications for erasure coding in combination with the Veeam use case was not detected during the testing period and the original hardware for a replication based infrastructure was therefore underpowered.
The original design for SWITCHengines storage as a multi-purpose cluster protected well against small, routine failures and gave a lot of flexibility but in the case of full catastrophe made it possible for an issue with a small subset of users to impact all on the cluster. This has been fully re-architecture in Lausanne, similar use cases removed from Zürich and the strategy for delivering those services on an ongoing basis is being redeveloped for implementation in Q3/Q4.
While the team’s expertise is at a very high level for Ceph operations and they correctly identified the root issue very quickly, we benefited during the incident from an additional critical response contract with Ceph specialists. They supported us for development and review of mitigation and restore measures, and following up with Ceph on related bugfixes. Following on from that we will be working with them on an evaluation of both our clusters to optimise configurations from a stability perspective.
SWITCH will continue to support all current use cases. The Veeam use cases that generated the large numbers of small objects (the root cause for this incident) are now using another storage system with a revised recommended configuration on both Veeam and infrastructure side to support the needed number of small objects. SWITCH had already planned a reassessment of storage strategy for the coming 3-5 years. This was expedited to take place in Q3 and will take the learnings from this incident into account.
Appendix: About Ceph and SWITCHengines
Ceph is an application that delivers object, block and file storage in one unified system. It is developed and managed as an active, well governed Open Source project under the Linux Foundation with strong hardware and software industry contributions. (Canonical, RedHat, SUSE. Samsung, Intel, Western Digital). Over 50% of deployments are in commercial environments and the average size is c3PB, 2-5 cluster and more than 50 nodes. SWITCHengines fits this deployment profile very well. Most deployments keep to the latest release, use commodity hardware, HDD and Bluestore database, and use it mostly to support volume storage and integrate with OpenStack. This is also the context in which SWITCHengines selected and deployed Ceph. Over 6 years of use the platform has been very reliable for us and experienced no data loss.
The architecture of Ceph is scale out and distributed. At the top level are interfaces serving API access to object and block storage. Within the cluster, a number of Ceph Monitor processes run which maintain maps of the cluster state and components. Finally, at the lowest level, Ceph Object Storage Daemons (OSDs) handle replication, rebalancing and recovery. Typically there is one of these per individual drive.
SWITCHengines operates two Ceph clusters, one in Lausanne with < 50 nodes, and one in Zürich with < 100 nodes. The initial deployment implemented 3x replication for data and served primarily block storage, with some object storage also being used. The typical number of OSD processes for the Lausanne cluster to run was over 500. These processes determine how data is placed within the cluster based on the chosen data protection algorithm and available disk.
In 2019 SWITCH deployed the Ceph recommended new internal database, Bluestore, which improved the efficiency of data placement by removing the need for an additional XFS filesystem and upgraded to the current Nautilus release. Erasure coding was also deployed on a part of the clusters to support new backup use cases. In 2020 these use cases began actively shipping large quantities of data to the Lausanne cluster.
2020-05-12-2020-05-15 Storage Outage Lausanne
This is a technical postmortem on the outage to SWITCHengines between 12-15th May when extreme pressure on available memory caused an extended period of unavailability impacting both object storage and VMs in Lausanne.
On May 11 we observed stability problems with the Ceph (1) storage cluster in Lausanne. Increasing numbers of OSD (2) processes were being killed to handle out of memory issues. On the weekend prior to May 11th, a disk had failed in the cluster which triggered a routine rebalancing operation. This operation caused OSD processes to grow, filling up the memory on some of the servers they were running on, and causing the cluster to grind to a halt. OSDs which were dealing with the Erasure Coding (EC) (3) pool and the giant buckets using this pool were particularly resource hungry.
We managed to stabilise the processes by individually bringing them offline and back on again, but by the evening of May 12th the situation degraded significantly until 100s of processes were affected. Most operations on the cluster were blocked, impacting object storage and VMs.
On May 13th we brought the cluster up without S3 (4) for partial restoration of service as the source of the problems was related to an S3 EC use. In parallel, we worked together with our colleagues in the Ceph community to develop a strategy to relieve pressure on RAM and also ordered a RAM upgrade for the cluster. Over the course of the day we were able to bring most of the affected OSDs online, and restore S3 access. However, although this meant only a small amount of data was unavailable, the impact was wider as the data needed for each operation is distributed and a small subset being unavailable can block a full volume, for example. In parallel, we were able to source the RAM needed to double the capacity in 32 affected servers.
On May 14th we began deploying the measures identified with the Ceph community. Placement group (5) logs were trimmed and this was particularly effective as a mitigation. By 18:00, all data was available.
On May 15th, RAM was replaced throughout the day. Due to restrictions on travel/availability of overnight accommodation related to the current COVID-19 situation, the 16 largest nodes were upgraded with the remaining 16 to be scheduled to 22 May and 26 May.
Our learnings from his incident indicate that a combination of 8+3 EC pools, and buckets in those pools with hundreds of millions of very small objects pushed the cluster beyond safe operations for its specification. This use of the cluster is a factor of 50 more intensive for managing placement of data and other operations compared to our typical use case. As well as mitigation already deployed, we will be looking at other ways of supporting these use cases and continuing to upgrade the remaining 16 nodes.
(2) OSD: Ceph processes that manage storing data throughout the cluster and providing access to them for higher level operations
(3) Erasure Coding. In erasure coding, the data is broken in parts, then expanded and encoded. After that the data segments are kept in multiple locations. In an 8+3 configuration, data is split into 8 fragments an additional 3 parity fragments are created for protection.
(4) S3: API for object storage
(5) Placement groups: Ceph function that supports data distribution
2020-05-11 Storage issue Lausanne
On May 11 we had stability problems with the storage cluster in Lausanne. During the day, we were able to mitigate the worst of the effects. However, between between 18:00 and 21:30 it was necessary to take the cluster down for emergency repair without notification. The issue was related to memory pressure that was higher than we were used to so it was necessary to tune the systems accordingly. We were also hit by a bug that can occur under memory pressure, which we mitigated with a newer kernel. We continue to monitor the situation.
2020-04-02 MTU issue Lausanne
SWITCHengines Lausanne experienced a partial loss of control plane connectivity between 13:40 on Thursday April 2 and 08:25 on Friday April 3. Running VMs were unaffected and available normally, as was control plane access via IPv6 and certain IPv4 paths.
During a routine operation, non production ports on L2 infrastructure were configured. This caused more ports on the bridge to default to a 1500 byte MTU and a subset of traffic was blocked. Resetting the MTU size did not recover the path. Complex troubleshooting took place throughout the day to isolate the issue which was difficult to reproduce, and various workarounds were attempted unsuccessfully.
At 08:20 the following day, the return route to the control plane nodes was temporarily diverted from the affected path and at 8:45 a hard reboot of the Cumulus switches restored stable operations and the workaround was removed.
2020-03-25 Control Plane Outage Zürich
Control plane functionality for SWITCHengines ZH was unavailable on 25.03.2020/26.03.2020 between 20:00 and 02:00. During this time it was not possible to create or make changes to VMs in that region. Running VMs were available as normal.
As part of routine clean up work of obsolete control plane infrastructure, at 20:00 the deletion of an old master node caused the failure of a number of other nodes and put pressure on the message queueing system which needed to be recovered.
The pods running these nodes were recovered between 20:15 and 21:50. The recovery of the message queuing system was completed by 02:00 on the 26.03.2020
2020-03-16 Control Plane and Provisioning Service Outage Zürich
On 16.3.2020 21:12 Monitoring detected that the OpenStack compute service in Zürich was unavailable. Troubleshooting restored main functions within an hour, and the remaining operations were all stable by 22:38. Running VMs were available throughout the issue. In order to address any further concerns, the measures already deployed to Lausanne during the week to simplify control plane networking will be deployed to Zürich ASAP.
2020-03-03 Control Plane Outage Zürich
On 3.3.2020 between 08:30 and 14:30 failures occurred on the control plane for SWITCHengines making it impossible to create new VMs and similar administrative operations. Existing running VMs were still operational.
As part of an ongoing project to improve network stability for SWITCHengines, scheduled maintenance took place in Zürich to move parts of the management network from Brocades which are reaching end of life to Cumulus switches which are the standard maintained platform. The maintenance window work was completed between 06:00-08:30 and had an expected break in connectivity which was restored successfully.
Shortly after the completion of the work, it was observed that the message queuing component, RabbitMQ was experiencing long queues and some OpenStack control plane agents were not responsive. When restarting RabbitMQ did not restore service, further investigations identified that packets were being fragmented between some hosts, suggesting an MTU issue. By 11:00 all affected hosts with this issue had been identified and mitigation for this problem deployed.
Once the MTU issue was under control, troubleshooting moved to the RabbitMQ service and the Oslo messaging service within OpenStack which was experiencing a cascade effect from the MTU interruption. During this troubleshooting, control plane services were systematically restarted, starting with lowest potential customer impact first. The strategy was to address the load issue and avoid a ‘stampeding herd’ effect and to flush the queues of potential corrupt messages. The measures were finally effective by 14:30.
Although the MTU problem was the most probable starting condition, a bug in the Oslo messaging service of OpenStack that causes daemons to not properly reconnect to the correct queue in rabbitmq after a connection loss is strongly suspected to contribute to the long queues on RabbitMQ. Therefore, even if the MTU issue does not occur in the equivalent maintenance in Lausanne, the risk level of the work was reevaluated and the maintenance window rescheduled to permit some preventative measures to be put in place.
A full OpenStack upgrade would be required to fix the Oslo bug. This cannot safely be scheduled before the next network maintenance in Lausanne, so alternative mitigation has been identified and planned. These measures include:
Further simplifying control plane networking by using internal networks rather than routing via ingress points, reducing the impact of changes
Actively profiling the detailed network conditions to identify any potential MTU issues
Learning from this issue the most effective sequence of control plane interventions should the issue recur so that they can be deployed more rapidly.
2019-08-13 - Problems after Ocata Upgrade in LS
SWITCHengines LS was upgraded to Ocata. We encountered some issues that had impact on operations:
- The upgrade of the network nodes took a lot longer than expected (due to upgrades of the operating system). The network nodes are not in containers and will move to dedicated hardware in the future. That, couple with a problem in our internal firewall rules caused a 53 minute long disruption of network traffic to a subset of virtual machines in LS around noon yesterday. We sincerely apologise for this. In order to prevent a similar incident in ZH, we are doing the preliminary package upgrades to the network nodes in or regular Tuesday morning maintenance window (2019-08-20). Because we have redundant network nodes we don’t expect any network interruptions
- Access to the OpenStack API endpoints was slow or interrupted yesterday afternoon until earlier this morning for clients that connected via IPv4. The fundamental problem was due to a failing switch (which caused packets larger than 1500 bytes to be dropped).
- Access to authenticated S3 buckets was impacted until 20:15 yesterday - we had to move Keystone (the identity service) from LS to ZH to fix it. We are still investigating the root cause.
2019-06-12 11:00 - 2019-06-13 19:30 - Problems with S3 storage in LS
There is a post mortem document for all three outages on 2019-06-12
A software problem with the underying Ceph storage makes access to certain buckets hosted on our RadosGW (S3 compatible storage) impossible. We are working with engineers from RedHat on diagnosing and fixing the problem. 31 buckets currently are inacessible at the moment.
The problem could be fixed, service is restored no dataloss occured
2019-06-12 11:15 - 2019-06-12 13:00 - Outage on control plane ZH
From 11:15 an issue affecting control plane communication in SWITCHengines ZH was identified. The cause was attributed to communications/networking causing issues for rabbitMQ. This was an unexpected side effect of an MTU issue during scheduled maintenance. It was fixed by relieving pressure on the message queue and restarting Neutron, the networking service. The issue was stabilised and service restored at 13:00.
2019-06-12 13:30 - 2019-06-12 13:57 - Partial outage of network connectivity in ZH
A crashing network node (which carries traffic from VMs to the internet) took down connectivity for all virtual routers hosted on that network node. SWITCHengines ZH has 3 network nodes, so roughly a third of the VMs were affected. We migrated the virtual routers to the other networks which restored connectivity.
2018-11-26 15:20 - 15:47 - Loss of network connectivity in ZH to some VMs
Due to a crashed process on one of our infrastructure networking nodes, a portion of VMs running in the ZH region lost their in and outbound network connectivity.
After a reboot of the affected component, network access was restored.
At 15:20:55 the software on one of our three network nodes crashed. This led to around 150 virtual routers dying. All VMs attached to these routers lost connectivity. A regular restart of the software did not solve the problem, so we rebooted the component at 15:38. After the software restarts, it has to rebuild the virtual network infrastructure. This process takes a couple of minutes and connectivity was fully restored at 15:47.
2018-10-08 11:00–2018-10-28 00:55: ZH/LS: Increased latency on virtual disks attached to VMs
Starting on 2018-10-10, several customers complained about extreme latency problems, leading to blocked VMs and services, on both the LS and the ZH ceph clusters. We didn’t see those problems ourselves, but have had emails from our customers in various state of agitation.
Initially, we treated the individual tickets as isolated incidents. In some cases (gitlab.customer1.ch), we rebooted customer VMs to resolve the blockage, which would reappear one or a few days later. Another customer (customer2) shut down their machine for weeks because they found that their application—Veeam backups to a VM—had effectively stopped working.
Attempts to address the issue based on sketchy picture of the problems
At this stage, we suspected that the latency issues were related to ongoing work on the Ceph clusters, in particular the upgrade of storage servers (OSDs) from the older “FileStore” to the newer “BlueStore” format. Our approach is to upgrade individual servers at a right of about one per day in each cluster. The process of filling an upgraded server’s disks with data causes a large amount of traffic within the cluster, and we thought that this load might be the reason for the high latency perceived by some of our users. We tried to address that by slowing down the migration process. Eventually we stopped the upgrade process altogether.
But the problems continued. We couldn’t even assess the extent of the issue: Only a few users complained about them, mostly those who have VMs with one or several RBD (virtual disk) volumes of a Terabyte or more.
Start of intense debugging work
On 19 October, a crisis was declared, and the team as a whole tasked with (among other things) identifying the cause or causes of the ongoing problems in the clusters, and ideally solving them. In addition, we were supposed to develop a performance indicator that could be used by us to notice such issues before our customers call us.
The team first worked on latency in general and entered into discussions about what latency should be expected. In particular, the only latency indicator that we do have—the “Cluster I/O Latency” graphs in the Ceph overview dashboards in Grafana—seemed to show a gradual increase of about 30% over the past few months. This turned out to have nothing to do with the problem that caused our users grief, but we spent a lot of time and effort interpreting and trying to explain these measurements. A beneficial outcome is that we now better understand what these graphs measure, and what they fail to see.
Towards reproducibility of the problem
The gitlab.customer1 .ch hangs provided us with opportunities to debug the issues, because that was the only machine that we could access and that exhibited the problems. The problem was that it also provides an important service for CUSTOMER1, and at first we had no opportunity to do any diagnosis, because we had to restore service (by rebooting the VM), and that made the problem go away for the next day or couple of days. Later we separated the backup storage into a separate filesystem on a separate RBD disk. This gave us some time for diagnosing the issues when they were isolated to writing or reading backups.
On 25 October, we wrote a Python script in an attempt to reproduce the problem ourselves. The initial hypothesis was that the VMs were running into some limitation when large RBD disks are used, and the Ceph client in the VM monitor (Qemu) has to create many threads and many TCP sockets to OSDs. The script tried to quickly provoke such a situation by writing short strings to large (first 1TB, then two 2TB) RBD volumes via raw device (e.g. /dev/vdb) access. Between the short writes, the script skipped over 4MiB blocks to touch a maximum of Rados objects—each RBD volume/image is split into many 4MiB objects.
The script failed to reproduce the problem, because the hypothesis was wrong. We then increased the size of the writes-per-4MiB-page from a few bytes, first to 64KiB, and then to the full 4MiB, at which the script continuously wrote over the entire volume. At that point, the script would hit the issue: After some time (often many minutes or even hours), a 4MiB write request would suddenly take 27 minutes instead of the usual fraction of a second. This was the first time we could reproduce the problem with a “synthetic” workload without getting any users harmed.
TCP bug in a version of our Linux kernel
On the next day, we were able to analyze these situations in more detail by tracing the traffic of the VM (in a “slow volume” state). After lengthy debugging sessions, we found that the TCP traffic from the writing VM—the sender—to an OSD—the receiver—was limited to one 512-byte TCP segment every 200 milliseconds. We looked for possible reasons for the slow progress, but couldn’t find any; both sender and receiver had ample buffer space, and we couldn’t find any other bottleneck.
This mystery was resolved on the next day (2018-10-27, a Saturday), when another team member found Ubuntu bug 1796895, which fit that symptom quite well. We found that the buggy kernel version was present on 9 storage servers and 1 compute node. We quickly upgraded the kernels on those servers, and the problem hasn’t occurred since then.
Provenance of the kernel TCP bug
We wanted to know more about the kernel bug, and found that it was never present in the “upstream” Linux kernel. It was introduced in the LTS Linux kernel when two related patches—submitted to the upstream kernel maintainers in the space of a second—were separately backported from upstream to stable, with several weeks between them. The (buggy) version of the LTS kernel was available for those several weeks, and it was picked up by the Ubuntu kernel maintainers later, and published as official Ubuntu Linux kernels.
The separation of these patches by the LTS kernel maintainers is what created the bug. It seems promising to work on improvements in the Linux kernel maintenance process to reduce the possibility of this type of problem occurring in the future. We intend to bring this bug, and its annoying consequences for us, to the attention of the kernel maintainer community in a friendly manner to encourage reflection of such process improvements.
The proper debugging of issues that are affecting our users’ ability to work should be given priority over everything else. Effort spent on the thorough debugging of such issues should be encouraged and rewarded rather than seen as a distraction from planned work.
When users notify us about performance problems, we must expend effort to truly understand their observations. We cannot expect users to characterize issues in our terms, so we must ask directed questions so that we can get a good picture. Unless/until the user has confirmed that the problem has gone away, we should assume that it still exists. If we cannot reproduce the symptoms ourselves, we should ask the user for access to the affected systems—they will usually be happy to grant it if they believe that we’ll make efforts to solve their issue.
Performance debugging is hard. We need to improve our analytical team skills as well as our understanding of the overall system, so that we can reason effectively about possible causes.
In particular, we need to get better at testing hypotheses using reflection/discussion and experiments. This should reduce the time and effort expended on side-tracks with limited impact on user-perceived performance.
Actions to take
We took adaptive actions after the problem was identified. In addition these are further actions to take:
- Make Upstream Linux developers aware of the problem and how it could have been prevented (combining the two separate merge requests, commenting that they should be applied together)
- Don’t upgrade components (in this case the Ceph BlueStore migration) in two locations at the same time
- Only install kernels that have been installed and vetted in the staging environemnt
- Training engineering team on deductive reasoning in complex systems
- Review metrics / dashboards to see if they measure what they say the measure
06.09.2018, 13:06–13:23: ZH Region: Multiple interruptions of connectivity to many VMs
Between 13:06:07 and 13:23:12 (MET DST), connectivity was interrupted for multiple short periods of time (seconds to minutes) for SWITCHengines VMs on some, but not all, tenant networks in the ZH region. The longest continuous interruption was between 13:06:07 and 13:11:23.
Reason for outage: We had to modify some access lists on some hosts as part of the process of extending the infrastructure with new servers. In the process, we created a duplicate clause within the access list. In principle this would have been harmless, because the second clause would simply have been ignored. But unexpectedly to us, a daemon that is part of the software-defined networking part of SWITCHengines (OpenStack Neutron) detected the duplicate rule and "helpfully" deleted the first occurrence. That caused legitimate traffic to be dropped by a consecutive deny-all rule; in particular, all external traffic for tenant networks whose router happened to be running on that particular host—about a third of the tenant networks in the ZH region. At first we repeatedly re-inserted the deleted rule manually a few times before understanding the underlying issue. Since we deleted the second occurrence at 13:23, connectivity has been stable again. Please apologize the inconvenience.
16.08.2018: LS Region
2018-08-16 13:30 After rebooting the affected switch, network connectivity has been restored and all systems are operational again.
2018-08-16 12:40: We have identified a crashed switch as the most probable cause for the outage. We are now working on rebooting it.
2018-08-16 11:30: The network connectivity to our datacenter in Lausanne is broken. All VMs in Lausanne are not reachable, the VMs in Zürich are still running. But our Identity Server (Keystone) is running in Lausanne, therefore no operations (start/stop) are possible.
08.02.2018: ZH Region
Around 16:30 our monitoring systems noticed that we have lost parts of our network connectivity to running VMs. Our engineers are looking into the problem.
17:00. One of our hardware border switches had a software failure which left it in an undefined/weird state. The redundant backup switch didn't deem it necessary to take over operations. We have dispatched engineers to the datacenter to reboot the switch.
17:45 Our engineers are rebooting the switches. Connectivity is slowly being restored.
18:30 Due to the network outage, our Ceph (Storage) cluster needs to be restarted. This takes a bit of time until we have restarted all service processes for each disk.
18:45 The Ceph cluster is back online - all services and VMs are running normally again.
We will look into the root cause of the incident and see what we can do to mitigate against these problems in the future. Thanks for your patience and sorry for the disruption.
01.09.2017: ZH region:
On September 1st 2017 we were testing the upgrade of our control plane bare metal servers from Trusty to Xenial. Due to an error in the access procedure to the Serial Console of the bare metal servers, we gave some package upgrade commands to a production server. This lead to the upgrade of the Openvswitch package that was not correctly configured after the upgrade.
At 11:06 puppet did run on the production bare metal server during the upgrade progress, and destroying the configuration. The fact that the report of the puppet run reached our Foreman server, shows that IPv6 connectivity was still working. However IPv4 was broken and puppet failed various actions because of that.
At 11:30 we see from the networking monitoring that the volume of traffic has a sharp decrease on the production server hit by this incident.At 11:34 our Nagios server informs us that Pet VMs are offline.
We start the troubleshooting and we notice that the server is up, but the network configuration is incorrect. Looking at the LOGS we find the root access from the console at 10:53, and we understand that the login was done on the wrong server.
We finished the upgrade procedure, and we fixed the configuration. Our team finished to work on the incident at 12:38
The problem affected some VMs of the control plane, and 2 out of 3 network nodes, starting at 11:30, to 12:15, for a total incident time of 45 minutes.
13.06.2017 : ZH region: due to a network problem started at 11:00 the storage cluster became slower than usual, some cloud instances were reported to be stucked/blocked for some time...
Everything resumed to normal at ~12h00.