Dear IANVS users,
the cluster is back online.
The issue happened because yesterday, on 2024-09-11 15:50, an error during repairs to the ITZ’s cooling infrastructure led to a short building-wide power outage on one of our two power lanes. The ITZ’s systems have redundant power supplies and were mostly not affected. While IANVS has redundant power supplies as well, the outage has resulted in power loss to two of the cluster’s power supplies, one of the cooling racks, and several TrueScale switches. This took down the distributed file system, which in turn took down the resource scheduler and the job queue.
All systems, except for the failing power supplies, have now been put back online. We are in the process of procuring replacement parts for the power supplies and have implemented a workaround in the interim.
We are looking into ways to mitigate similar issues in the future and are working closely with department 4.4.2.
Please contact us if you have any questions: via e-mail or phone, (0345) 55-21864.
Best regards,
ITZ HPC Team