Automated service monitoring in the deployment of ARCHER2

Kieran Leach, Philip Cass, Steven Robson, Eimantas Kazakevicius, Martin Lafferty, Andrew Turner, Alan D Simpson

Research output: Contribution to conferencePaperpeer-review

Abstract / Description of output

The ARCHER2 service, a CPU based HPE Cray EX system with 750,080 cores (5,860 nodes), has been deployed throughout 2020 and 2021, going into full service in December of 2021. A key part of the work during this deployment was the integration of ARCHER2 into our local monitoring systems. As ARCHER2 was one of the very first large-scale EX deployments, this involved close collaboration and development work with the HPE team through a global pandemic situation where collaboration and co-working was significantly more challenging than usual. The deployment included the creation of automated checks and visual representations of system status which needed to be made available to external parties for diagnosis and interpretation. We will describe how these checks have been deployed and how data gathered played a key role in the deployment of ARCHER2, the commissioning of the plant infrastructure, the conduct of HPL runs for submission to the Top500 and contractual monitoring of the availability of the ARCHER2 service during its commissioning and early life.
Original languageEnglish
Number of pages7
Publication statusPublished - 12 May 2022
EventCray User Group - Monterey, United States
Duration: 1 May 20225 May 2022
Conference number: 2022
https://cug.org/cug-2022/

Conference

ConferenceCray User Group
Abbreviated titleCUG
Country/TerritoryUnited States
CityMonterey
Period1/05/225/05/22
Internet address

Fingerprint

Dive into the research topics of 'Automated service monitoring in the deployment of ARCHER2'. Together they form a unique fingerprint.

Cite this