Predicting accurate batch queue wait times on production supercomputers by combining machine learning techniques

Research output: Contribution to journalArticlepeer-review

Abstract / Description of output

The ability to accurately predict when a job on a supercomputer will leave the queue and start to run is not only beneficial for providing insights to users, but can also help enable non-traditional HPC workloads that are not necessarily suited to the batch queue style-approach that is ubiquitous on production HPC machines. However there are numerous challenges in achieving such a prediction with high accuracy, not least because the queue's state can change rapidly and depend upon many factors. In this work we explore a novel machine learning approach for predicting queue wait times, hypothesising that such a model can capture the complex behaviour resulting from the queue policy and other interactions to generate accurate job start times.

For ARCHER2 (HPE Cray EX), Cirrus (HPE 8600) and 4-cabinet (HPE Cray EX) we explore how different machine learning approaches and techniques improve the accuracy of our predictions, comparing against the estimation generated by Slurm. By combining categorisation and regression models we demonstrate that our approach delivers the most accurate predictions across our machines of interest, with the result of this work being the ability to predict job start times within one minute of the actual start time for around 65% of jobs on ARCHER2 and 4-cabinet, and 76% of jobs on Cirrus. When compared against what Slurm can deliver, via the backfill plugin, this represents around 3.8 times better accuracy on ARCHER2 and 18 times better for Cirrus. Furthermore our approach can accurately predicting the start time for three quarters of all job within ten minutes of the actual start time on ARCHER2 and 4-cabinet, and for 90\% of jobs on Cirrus. Whilst the initial driver of this work was to better facilitate non-traditional, interactive and urgent, workloads on HPC machines, the insights gained can also be used to provide wider benefits to users, enrich existing batch queue systems, and inform supercomputing centre policy also.
Original languageEnglish
Article numbere8112
JournalConcurrency and Computation: Practice and Experience
Volume36
Issue number15
DOIs
Publication statusPublished - 10 Jul 2024

Keywords / Materials (for Non-textual outputs)

  • boosted trees
  • classification
  • HPC
  • machine learning
  • queue wait time prediction
  • regression

Fingerprint

Dive into the research topics of 'Predicting accurate batch queue wait times on production supercomputers by combining machine learning techniques'. Together they form a unique fingerprint.

Cite this