RSS-Feed abonnieren

Passwort vergessen?

v Anmelden

8. Aug 2019

IANVS: Postmortem for outage on Aug 5th–Aug 6th

Verfasst von

Dear IANVS users,

on Monday morning, the IANVS cluster went down due to a major filesystem outage. This is an explanation of what happened and what we’ve learned.

What happened?

At 2:14, a number of kernel threads deadlocked on our filesystem servers, which resulted in recoverable filesystem unmounts on several nodes. Deadlocks are an accepted part of life with any sufficiently complicated parallelized computer program; GPFS generally handles them fairly well and also has a diagnostic for identifying them („mmdiag –deadlocks“), but we’ve never been able to use this tool to satisfaction.

At the same time, we also received the first „quorum cannot be reached“ messages. After a few minutes, this resulted in a mass unmount of /home and /scratch from all nodes. Despite first appearances, this is actually a good thing: If the filesystem were to continue operating without an established quorum between all fileservers, there would be no agreement on which data should be written to disk, and the next reads from the filesystem could return corrupted data. Incidentally, this is the reason why users who were already logged into the cluster were seeing the „stale file handle“ messages: Their session still had the files open, but those „open files“ were not backed by actual files on disk anymore.

At 2:15, GPFS reported failed disks from the fileservers gpfs07 and gpfs08. (A later investigation revealed that this was a red herring: Our disks are backed by hardware RAID6 sets, which can tolerate up to two disk failures per set. When we went to look into the server room, none of our disk systems showed any problems. We cross-checked with software utilities, but those did not show any errors either.)

At 2:20, the two GPFS servers with failing disks, gpfs07 and gpfs08, rebooted without any indication as to why, resulting in a clear quorum loss. The rest of the GPFS nodes signalled to the compute, login, and management nodes that they should unmount the /home and /scratch filesystems, which they did.

At 10:50, the administrator was made aware of the issue. He notified the users and started investigating. No disks were faulty despite GPFS’s claims, so he tried to restart them and replay the filesystem journal onto them. However, this didn’t work, as GPFS was not able to find the disks.

At 13:00, the cluster vendor joined the investigation. They found a mismatch in the system clocks of the GPFS servers, which can be a major problem in a distributed system like GPFS: timers are used intensively to establish fault tolerance. The clocks of  the offending servers (gpfs07 and gpfs08) were off by as much as 17 minutes:

gpfs01: Mon Aug  5 17:35:04 CEST 2019
gpfs02: Mon Aug  5 17:35:44 CEST 2019
gpfs03: Mon Aug  5 17:33:27 CEST 2019
gpfs04: Mon Aug  5 17:24:04 CEST 2019
gpfs05: Mon Aug  5 17:31:24 CEST 2019
gpfs06: Mon Aug  5 17:32:18 CEST 2019
gpfs07: Mon Aug  5 17:21:30 CEST 2019
gpfs08: Mon Aug  5 17:18:20 CEST 2019

Clock synchronization on Linux is handled using the Network Time Protocol, and the university has an NTP server (mlutime.uni-halle.de) that serves as a synchronization source. However, even though the GPFS servers were configured to communicate with the NTP server, a network configuration error stopped them short. A workaround for this problem was established, and the GPFS servers were again able to form a quorum.

The GPFS servers were rebooted, and finally we were able to add the missing disks back into the filesystem.

At 18:00, the filesystem was in working order again. After performing some sanity checks and confirming that there was no data loss (the underlying disks were still working fine, after all), IANVS went back online on Aug 6th, 12:00.

What did we learn?

  • All our procedures and tools have so far been tailored towards scheduled maintenance outages. We will look into establishing and testing appropriate emergency procedures as well.
  • The latest (and sometimes the only working!) version of some of our admin tools were placed around the cluster in different locations. For example, the script that collects email addresses for the HPC newsletter was located on… GPFS itself. Instead of using this version, the administrator had to rework an old version of the script. We have now put the admin tools in a maximally available location.
  • Correct clocks are important. In a distributed system, correct clocks are paramount. We will therefore monitor the node clocks going forward.
  • The current routing workaround will be replaced with a cleaner, direct solution. Doing so will free up resources for other services critical to the cluster.
  • The deadlocks—and the red herring involving the „failed disks“—is probably fixed in a sufficiently new version of GPFS. We will look into timely GPFS updates in the future (while keeping in mind that updating the filesystem requires downtime).

Please contact us if you have any questions: via e-mail or phone, (0345) 55-21864 or (0345) 55-21861.

Best regards,
ITZ HPC team

Über Patrice Peterson

Kommentare sind geschlossen.


Seiten

Letzte Kommentare