RSS-Feed abonnieren

Passwort vergessen?

v Anmelden

18. Nov 2017

IANVS Back Online

Verfasst von

Dear IANVS users,

the cluster is back online; we apologize for the extended downtime.

Additional changes to those previously listed have been implemented:

  • The maximum array index has been increased to 1.000.000. (The number of array jobs per user is still capped at 40.000.)
  • Jobs will now only be optimized for network topology if they include switch options (i.e. ib12 or ib18 constraints or the –switches parameter). Optimizing jobs for network topology involves high system overhead and is unnecessary for jobs that do not take advantage of optimal network placement. It would also lead to increased resource fragmentation if most job allocations are not optimized for network topology.
  • Jobs can now be profiled via the –profile switch. This will output a HDF5 file with time series data on resource usage to a subdirectory under /scratch/profile. See the SLURM documentation page on profiling for details.
  • We have enabled Simultaneous Multithreading on all small, large, and special nodes. (The gpu nodes already had SMT enabled.). However, you will need to explicitly opt-in to using both SMT threads on a core; by default, only one thread per core is used. You can find a small example submit script in our wiki. We will continue to monitor the effect this has on job throughput and system stability.
  • The DMTCP program has been installed and can be used for checkpointing jobs. (The previously mentioned BLCR program has been deprecated and will be removed in the next major version of SLURM.) However, not all jobs are eligible for checkpointing; most notably, GPU jobs cannot currently be checkpointed. We will add documentation regarding this feature in the very near future. For now, please refer to the Quickstart file.
  • The –mail-type=TIME_LIMIT, TIME_LIMIT_50, TIME_LIMIT_80, and TIME_LIMIT_90 options have been added to sbatch and srun. The scheduler sends an email upon reaching 50, 80, and 90% of the time limit, respectively.
  • SLURM now includes a job state summary and CPU/memory efficiency analysis in its emails. An example:

    [ianvs] -s SLURM Job_id=2837963 Name=longjob1 Ended, Run time 00:03:09, TIMEOUT, ExitCode 0
    Job ID: 123456
    Cluster: ianvs
    User/Group: ianvs/hpc
    State: COMPLETED (exit code 0)
    Nodes: 1
    Cores per node: 2
    CPU Utilized: 00:00:00
    CPU Efficiency: 0.00% of 00:06:18 core-walltime
    Job Wall-clock time: 00:03:09
    Memory Utilized: 3.80 MB
    Memory Efficiency: 3.80% of 100.00 MB

Some additional smaller changes have been implemented:

  • SLURM scheduling performance was improved. Bursts of thousands of (non-array) jobs and more should not pose a problem anymore, though we still ask you to use array jobs where appropriate.
  • Resource fragmentation should decrease thanks to cores not being immediately re-allocated upon completion of a job.
  • Nodes can now batch messages, taking load off the scheduler and leaving more usable resources on the login nodes.
  • Job steps using srun should start faster.
  • The test reservation (which is administrator-exclusive) will now be active Mo–Fr 6:30–20:30. The reservation’s resources can be used by regular jobs over the weekend and in the night.
  • We have upgraded to CentOS 7.4 and SLURM 17.11.

Several things are planned in the near future:

  • Documentation for the newly implemented features is currently lacking. We will improve it over the next days and weeks.
  • We will look into the MPI scaling problem.
  • We will introduce research group folders, which can be accessed by all members of a research group. This feature is currently in the prototyping phase and will be rolled out gradually.
  • SLURM 17.11 has made available a generalized „billing“ resource. Once we find acceptable parameters for this generalized resource, we aim to significantly relax resource limits—for example the current 400 job limit per user.

Please contact us if you have any questions: via e-mail or phone, (0345) 55-21864 or (0345) 55-21861.

Best regards,
ITZ HPC team

Über Patrice Peterson