Chronological List of outages
There is a post mortem document for all three outages on 2019-06-12
2019-06-12 11:00 - 2019-06-13 19:30 - Problems with S3 storage in LS
A software problem with the underying Ceph storage makes access to certain buckets hosted on our RadosGW (S3 compatible storage) impossible. We are working with engineers from RedHat on diagnosing and fixing the problem. 31 buckets currently are inacessible at the moment.
The problem could be fixed, service is restored no dataloss occured
2019-06-12 11:15 - 2019-06-12 13:00 - Outage on control plane ZH
From 11:15 an issue affecting control plane communication in SWITCHengines ZH was identified. The cause was attributed to communications/networking causing issues for rabbitMQ. This was an unexpected side effect of an MTU issue during scheduled maintenance. It was fixed by relieving pressure on the message queue and restarting Neutron, the networking service. The issue was stabilised and service restored at 13:00.
2019-06-12 13:30 - 2019-06-12 13:57 - Partial outage of network connectivity in ZH
A crashing network node (which carries traffic from VMs to the internet) took down connectivity for all virtual routers hosted on that network node. SWITCHengines ZH has 3 network nodes, so roughly a third of the VMs were affected. We migrated the virtual routers to the other networks which restored connectivity.
2018-11-26 15:20 - 15:47 - Loss of network connectivity in ZH to some VMs
Due to a crashed process on one of our infrastructure networking nodes, a portion of VMs running in the ZH region lost their in and outbound network connectivity.
After a reboot of the affected component, network access was restored.
At 15:20:55 the software on one of our three network nodes crashed. This led to around 150 virtual routers dying. All VMs attached to these routers lost connectivity. A regular restart of the software did not solve the problem, so we rebooted the component at 15:38. After the software restarts, it has to rebuild the virtual network infrastructure. This process takes a couple of minutes and connectivity was fully restored at 15:47.
2018-10-08 11:00–2018-10-28 00:55: ZH/LS: Increased latency on virtual disks attached to VMs
Starting on 2018-10-10, several customers complained about extreme latency problems, leading to blocked VMs and services, on both the LS and the ZH ceph clusters. We didn’t see those problems ourselves, but have had emails from our customers in various state of agitation.
Initially, we treated the individual tickets as isolated incidents. In some cases (gitlab.customer1.ch), we rebooted customer VMs to resolve the blockage, which would reappear one or a few days later. Another customer (customer2) shut down their machine for weeks because they found that their application—Veeam backups to a VM—had effectively stopped working.
Attempts to address the issue based on sketchy picture of the problems
At this stage, we suspected that the latency issues were related to ongoing work on the Ceph clusters, in particular the upgrade of storage servers (OSDs) from the older “FileStore” to the newer “BlueStore” format. Our approach is to upgrade individual servers at a right of about one per day in each cluster. The process of filling an upgraded server’s disks with data causes a large amount of traffic within the cluster, and we thought that this load might be the reason for the high latency perceived by some of our users. We tried to address that by slowing down the migration process. Eventually we stopped the upgrade process altogether.
But the problems continued. We couldn’t even assess the extent of the issue: Only a few users complained about them, mostly those who have VMs with one or several RBD (virtual disk) volumes of a Terabyte or more.
Start of intense debugging work
On 19 October, a crisis was declared, and the team as a whole tasked with (among other things) identifying the cause or causes of the ongoing problems in the clusters, and ideally solving them. In addition, we were supposed to develop a performance indicator that could be used by us to notice such issues before our customers call us.
The team first worked on latency in general and entered into discussions about what latency should be expected. In particular, the only latency indicator that we do have—the “Cluster I/O Latency” graphs in the Ceph overview dashboards in Grafana—seemed to show a gradual increase of about 30% over the past few months. This turned out to have nothing to do with the problem that caused our users grief, but we spent a lot of time and effort interpreting and trying to explain these measurements. A beneficial outcome is that we now better understand what these graphs measure, and what they fail to see.
Towards reproducibility of the problem
The gitlab.customer1 .ch hangs provided us with opportunities to debug the issues, because that was the only machine that we could access and that exhibited the problems. The problem was that it also provides an important service for CUSTOMER1, and at first we had no opportunity to do any diagnosis, because we had to restore service (by rebooting the VM), and that made the problem go away for the next day or couple of days. Later we separated the backup storage into a separate filesystem on a separate RBD disk. This gave us some time for diagnosing the issues when they were isolated to writing or reading backups.
On 25 October, we wrote a Python script in an attempt to reproduce the problem ourselves. The initial hypothesis was that the VMs were running into some limitation when large RBD disks are used, and the Ceph client in the VM monitor (Qemu) has to create many threads and many TCP sockets to OSDs. The script tried to quickly provoke such a situation by writing short strings to large (first 1TB, then two 2TB) RBD volumes via raw device (e.g. /dev/vdb) access. Between the short writes, the script skipped over 4MiB blocks to touch a maximum of Rados objects—each RBD volume/image is split into many 4MiB objects.
The script failed to reproduce the problem, because the hypothesis was wrong. We then increased the size of the writes-per-4MiB-page from a few bytes, first to 64KiB, and then to the full 4MiB, at which the script continuously wrote over the entire volume. At that point, the script would hit the issue: After some time (often many minutes or even hours), a 4MiB write request would suddenly take 27 minutes instead of the usual fraction of a second. This was the first time we could reproduce the problem with a “synthetic” workload without getting any users harmed.
TCP bug in a version of our Linux kernel
On the next day, we were able to analyze these situations in more detail by tracing the traffic of the VM (in a “slow volume” state). After lengthy debugging sessions, we found that the TCP traffic from the writing VM—the sender—to an OSD—the receiver—was limited to one 512-byte TCP segment every 200 milliseconds. We looked for possible reasons for the slow progress, but couldn’t find any; both sender and receiver had ample buffer space, and we couldn’t find any other bottleneck.
This mystery was resolved on the next day (2018-10-27, a Saturday), when another team member found Ubuntu bug 1796895, which fit that symptom quite well. We found that the buggy kernel version was present on 9 storage servers and 1 compute node. We quickly upgraded the kernels on those servers, and the problem hasn’t occurred since then.
Provenance of the kernel TCP bug
We wanted to know more about the kernel bug, and found that it was never present in the “upstream” Linux kernel. It was introduced in the LTS Linux kernel when two related patches—submitted to the upstream kernel maintainers in the space of a second—were separately backported from upstream to stable, with several weeks between them. The (buggy) version of the LTS kernel was available for those several weeks, and it was picked up by the Ubuntu kernel maintainers later, and published as official Ubuntu Linux kernels.
The separation of these patches by the LTS kernel maintainers is what created the bug. It seems promising to work on improvements in the Linux kernel maintenance process to reduce the possibility of this type of problem occurring in the future. We intend to bring this bug, and its annoying consequences for us, to the attention of the kernel maintainer community in a friendly manner to encourage reflection of such process improvements.
The proper debugging of issues that are affecting our users’ ability to work should be given priority over everything else. Effort spent on the thorough debugging of such issues should be encouraged and rewarded rather than seen as a distraction from planned work.
When users notify us about performance problems, we must expend effort to truly understand their observations. We cannot expect users to characterize issues in our terms, so we must ask directed questions so that we can get a good picture. Unless/until the user has confirmed that the problem has gone away, we should assume that it still exists. If we cannot reproduce the symptoms ourselves, we should ask the user for access to the affected systems—they will usually be happy to grant it if they believe that we’ll make efforts to solve their issue.
Performance debugging is hard. We need to improve our analytical team skills as well as our understanding of the overall system, so that we can reason effectively about possible causes.
In particular, we need to get better at testing hypotheses using reflection/discussion and experiments. This should reduce the time and effort expended on side-tracks with limited impact on user-perceived performance.
Actions to take
We took adaptive actions after the problem was identified. In addition these are further actions to take:
- Make Upstream Linux developers aware of the problem and how it could have been prevented (combining the two separate merge requests, commenting that they should be applied together)
- Don’t upgrade components (in this case the Ceph BlueStore migration) in two locations at the same time
- Only install kernels that have been installed and vetted in the staging environemnt
- Training engineering team on deductive reasoning in complex systems
- Review metrics / dashboards to see if they measure what they say the measure
06.09.2018, 13:06–13:23: ZH Region: Multiple interruptions of connectivity to many VMs
Between 13:06:07 and 13:23:12 (MET DST), connectivity was interrupted for multiple short periods of time (seconds to minutes) for SWITCHengines VMs on some, but not all, tenant networks in the ZH region. The longest continuous interruption was between 13:06:07 and 13:11:23.
Reason for outage: We had to modify some access lists on some hosts as part of the process of extending the infrastructure with new servers. In the process, we created a duplicate clause within the access list. In principle this would have been harmless, because the second clause would simply have been ignored. But unexpectedly to us, a daemon that is part of the software-defined networking part of SWITCHengines (OpenStack Neutron) detected the duplicate rule and "helpfully" deleted the first occurrence. That caused legitimate traffic to be dropped by a consecutive deny-all rule; in particular, all external traffic for tenant networks whose router happened to be running on that particular host—about a third of the tenant networks in the ZH region. At first we repeatedly re-inserted the deleted rule manually a few times before understanding the underlying issue. Since we deleted the second occurrence at 13:23, connectivity has been stable again. Please apologize the inconvenience.
16.08.2018: LS Region
2018-08-16 13:30 After rebooting the affected switch, network connectivity has been restored and all systems are operational again.
2018-08-16 12:40: We have identified a crashed switch as the most probable cause for the outage. We are now working on rebooting it.
2018-08-16 11:30: The network connectivity to our datacenter in Lausanne is broken. All VMs in Lausanne are not reachable, the VMs in Zürich are still running. But our Identity Server (Keystone) is running in Lausanne, therefore no operations (start/stop) are possible.
08.02.2018: ZH Region
Around 16:30 our monitoring systems noticed that we have lost parts of our network connectivity to running VMs. Our engineers are looking into the problem.
17:00. One of our hardware border switches had a software failure which left it in an undefined/weird state. The redundant backup switch didn't deem it necessary to take over operations. We have dispatched engineers to the datacenter to reboot the switch.
17:45 Our engineers are rebooting the switches. Connectivity is slowly being restored.
18:30 Due to the network outage, our Ceph (Storage) cluster needs to be restarted. This takes a bit of time until we have restarted all service processes for each disk.
18:45 The Ceph cluster is back online - all services and VMs are running normally again.
We will look into the root cause of the incident and see what we can do to mitigate against these problems in the future. Thanks for your patience and sorry for the disruption.
01.09.2017: ZH region:
On September 1st 2017 we were testing the upgrade of our control plane bare metal servers from Trusty to Xenial. Due to an error in the access procedure to the Serial Console of the bare metal servers, we gave some package upgrade commands to a production server. This lead to the upgrade of the Openvswitch package that was not correctly configured after the upgrade.
At 11:06 puppet did run on the production bare metal server during the upgrade progress, and destroying the configuration. The fact that the report of the puppet run reached our Foreman server, shows that IPv6 connectivity was still working. However IPv4 was broken and puppet failed various actions because of that.
At 11:30 we see from the networking monitoring that the volume of traffic has a sharp decrease on the production server hit by this incident.At 11:34 our Nagios server informs us that Pet VMs are offline.
We start the troubleshooting and we notice that the server is up, but the network configuration is incorrect. Looking at the LOGS we find the root access from the console at 10:53, and we understand that the login was done on the wrong server.
We finished the upgrade procedure, and we fixed the configuration. Our team finished to work on the incident at 12:38
The problem affected some VMs of the control plane, and 2 out of 3 network nodes, starting at 11:30, to 12:15, for a total incident time of 45 minutes.
13.06.2017 : ZH region: due to a network problem started at 11:00 the storage cluster became slower than usual, some cloud instances were reported to be stucked/blocked for some time...
Everything resumed to normal at ~12h00.