TY - JOUR
T1 - Metrics reloaded: recommendations for image analysis validation
AU - Maier-Hein, Lena
AU - Reinke, Annika
AU - Godau, Patrick
AU - Tizabi, Minu D.
AU - Buettner, Florian
AU - Christodoulou, Evangelia
AU - Glocker, Ben
AU - Isensee, Fabian
AU - Kleesiek, Jens
AU - Kozubek, Michal
AU - Reyes, Mauricio
AU - Riegler, Michael A.
AU - Wiesenfarth, Manuel
AU - Kavur, A. Emre
AU - Sudre, Carole H.
AU - Baumgartner, Michael
AU - Eisenmann, Matthias
AU - Heckmann-Nötzel, Doreen
AU - Rädsch, Tim
AU - Acion, Laura
AU - Antonelli, Michela
AU - Bakas, Spyridon
AU - Benis, Arriel
AU - Blaschko, Matthew B.
AU - Cardoso, M. Jorge
AU - Cheplygina, Veronika
AU - Cimini, Beth A.
AU - Collins, Gary S.
AU - Farahani, Keyvan
AU - Ferrer, Luciana
AU - Galdran, Adrian
AU - van Ginneken, Bram
AU - Haase, Robert
AU - Hashimoto, Daniel A.
AU - Hoffman, Michael M.
AU - Huisman, Merel
AU - Jannin, Pierre
AU - Kahn, Charles E.
AU - Kainmueller, Dagmar
AU - Kainz, Bernhard
AU - Karargyris, Alexandros
AU - Karthikesalingam, Alan
AU - Kofler, Florian
AU - Kopp-Schneider, Annette
AU - Kreshuk, Anna
AU - Kurc, Tahsin
AU - Landman, Bennett A.
AU - Litjens, Geert
AU - Madani, Amin
AU - Maier-Hein, Klaus
AU - Martel, Anne L.
AU - Mattson, Peter
AU - Meijering, Erik
AU - Menze, Bjoern
AU - Moons, Karel G. M.
AU - Müller, Henning
AU - Nichyporuk, Brennan
AU - Nickel, Felix
AU - Petersen, Jens
AU - Rajpoot, Nasir
AU - Rieke, Nicola
AU - Saez-Rodriguez, Julio
AU - Sánchez, Clara I.
AU - Shetty, Shravya
AU - van Smeden, Maarten
AU - Summers, Ronald M.
AU - Taha, Abdel A.
AU - Tiulpin, Aleksei
AU - Tsaftaris, Sotirios A.
AU - Van Calster, Ben
AU - Varoquaux, Gaël
AU - Jäger, Paul F.
N1 - Funding Information:
This work was initiated by the Helmholtz Association of German Research Centers in the scope of the Helmholtz Imaging Incubator (HI), the MICCAI Special Interest Group on biomedical image analysis challenges and the benchmarking working group of the MONAI initiative. It received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement no. 101002198, NEURAL SPICING). It was further supported in part by the Intramural Research Program of the National Institutes of Health (NIH) Clinical Center as well as by the National Cancer Institute (NCI) and the National Institute of Neurological Disorders and Stroke (NINDS) of the NIH, under award numbers NCI:U01CA242871, NCI:U24CA279629 and NINDS:R01NS042645. The content of this publication is solely the responsibility of the authors and does not represent the official views of the NIH. T.A. acknowledges the Canada Institute for Advanced Research (CIFAR) AI Chairs program, the Natural Sciences and Engineering Research Council of Canada. F.B. was co-funded by the European Union (ERC, TAIPO, 101088594). Views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or the ERC. Neither the European Union nor the granting authority can be held responsible for them. V.C. acknowledges funding from Novo Nordisk Foundation (NNF21OC0068816) and Independent Research Council Denmark (1134-00017B). B.A.C. was supported by NIH grant P41 GM135019 and grant 2020-225720 from the Chan Zuckerberg Initiative DAF, an advised fund of the Silicon Valley Community Foundation. G.S.C. was supported by Cancer Research UK (program grant no. C49297/A27294). M.M.H. is supported by the Natural Sciences and Engineering Research Council of Canada (RGPIN-2022- 05134). A. Karargyris is supported by French State Funds managed by the ‘Agence Nationale de la Recherche (ANR)’ - ‘Investissements d’Avenir’ (Investments for the Future), grant ANR-10-IAHU- 02 (IHU Strasbourg). M.K. was supported by the Ministry of Education, Youth and Sports of the Czech Republic (project LM2018129). T.K. was supported in part by 4UH3-CA225021-03, 1U24CA180924-01A1, 3U24CA215109-02 and 1UG3-CA225-021-01 grants from the NIH. G.L. receives research funding from the Dutch Research Council, the Dutch Cancer Association, HealthHolland, the ERC, the European Union and the Innovative Medicine Initiative. C.H.S. is supported by an Alzheimer’s Society Junior Fellowship (AS-JF-17-011). M.R. is supported by Innosuisse (grant no. 31274.1) and Swiss National Science Foundation (grant no. 205320_212939). R.M.S. is supported by the Intramural Research Program of the NIH Clinical Center. A.T. acknowledges support from the Academy of Finland (Profi6 336449 funding program), University of Oulu strategic funding, Finnish Foundation for Cardiovascular Research, Wellbeing Services County of North Ostrobothnia (VTR project K62716) and the Terttu foundation. S.A.T. acknowledges the support of Canon Medical and the Royal Academy of Engineering and the Research Chairs and Senior Research Fellowships scheme (grant RCSRF1819\8\25). We thank N. Sautter, P. Vieten and T. Adler for proposing the name for the project. We thank P. Bankhead, F. Hamprecht, H. Kenngott, D. Moher and B. Stieltjes for fruitful discussions on the framework. We thank S. Steger for the data protection supervision and A. Trotter for the hosting of the surveys. We thank L. Mais for instantiating the use case for InS of neurons from the fruit fly in 3D multicolor light microscopy images. We further thank the Janelia FlyLight Project Team for providing us with example images for this use case. We thank the following people for testing the metric mappings, reviewing the recommendations and performing metric-centric testing: T. Adler, C. Bender, A. B. Qasim, K. Dreher, N. Holzwarth, M. Hübner, D. Michael, L. -R. Müller, M. Rees, T. Rix, M. Schellenberg, S. Seidlitz, J. Sellner, A. Srivastava, F. Wolf, A. E. Yamlahi, S. D. Almeida, M. Baumgartner, D. Bounias, T. Bungert, M. Fischer, L. Klein, G. Köhler, B. Kovács, C. Lueth, T. Norajitra, C. Ulrich, T. Wald, I. Alekseenko, X. Liu, A. Marheim Storås and V. Thambawita. We thank the following people for taking our social media community survey and providing helpful feedback for improving the framework: Y. Akemi, R. Anteby, C. Arthurs, P. De Backer, H. Badgery, M. Baugh, J. Bernal, D. Bounias, F. C. Kitamura, J. Carse, C. Chen, I. Flipse, N. Gaggion, C. González, P. M. Gordaliza, T. Horeman, L. Joskowicz, A. Jose, A. Kamath, B. Kelly, Y. Kirchhoff, L. A. Kobelke, L. Krämer, M. Krendel, J. LaMaster, T. de Lange, J. L. Lavanchy, J. Li, C. Lüth, L. Mais, A. Marheim Storås, V. Nath, C. Scannell, C. Pape, M. P. Schijven, A. Selvanetti, B. S. Fadida, R. Staff, J. Tan, E. Tkaczyk, R. T. Calumby, A. Vlontzos, W. Zhang, C. Zhao and J. Zhu.
Funding Information:
This work was initiated by the Helmholtz Association of German Research Centers in the scope of the Helmholtz Imaging Incubator (HI), the MICCAI Special Interest Group on biomedical image analysis challenges and the benchmarking working group of the MONAI initiative. It received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation program (grant agreement no. 101002198, NEURAL SPICING). It was further supported in part by the Intramural Research Program of the National Institutes of Health (NIH) Clinical Center as well as by the National Cancer Institute (NCI) and the National Institute of Neurological Disorders and Stroke (NINDS) of the NIH, under award numbers NCI:U01CA242871, NCI:U24CA279629 and NINDS:R01NS042645. The content of this publication is solely the responsibility of the authors and does not represent the official views of the NIH. T.A. acknowledges the Canada Institute for Advanced Research (CIFAR) AI Chairs program, the Natural Sciences and Engineering Research Council of Canada. F.B. was co-funded by the European Union (ERC, TAIPO, 101088594). Views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or the ERC. Neither the European Union nor the granting authority can be held responsible for them. V.C. acknowledges funding from Novo Nordisk Foundation (NNF21OC0068816) and Independent Research Council Denmark (1134-00017B). B.A.C. was supported by NIH grant P41 GM135019 and grant 2020-225720 from the Chan Zuckerberg Initiative DAF, an advised fund of the Silicon Valley Community Foundation. G.S.C. was supported by Cancer Research UK (program grant no. C49297/A27294). M.M.H. is supported by the Natural Sciences and Engineering Research Council of Canada (RGPIN-2022- 05134). A. Karargyris is supported by French State Funds managed by the ‘Agence Nationale de la Recherche (ANR)’ - ‘Investissements d’Avenir’ (Investments for the Future), grant ANR-10-IAHU- 02 (IHU Strasbourg). M.K. was supported by the Ministry of Education, Youth and Sports of the Czech Republic (project LM2018129). T.K. was supported in part by 4UH3-CA225021-03, 1U24CA180924-01A1, 3U24CA215109-02 and 1UG3-CA225-021-01 grants from the NIH. G.L. receives research funding from the Dutch Research Council, the Dutch Cancer Association, HealthHolland, the ERC, the European Union and the Innovative Medicine Initiative. C.H.S. is supported by an Alzheimer’s Society Junior Fellowship (AS-JF-17-011). M.R. is supported by Innosuisse (grant no. 31274.1) and Swiss National Science Foundation (grant no. 205320_212939). R.M.S. is supported by the Intramural Research Program of the NIH Clinical Center. A.T. acknowledges support from the Academy of Finland (Profi6 336449 funding program), University of Oulu strategic funding, Finnish Foundation for Cardiovascular Research, Wellbeing Services County of North Ostrobothnia (VTR project K62716) and the Terttu foundation. S.A.T. acknowledges the support of Canon Medical and the Royal Academy of Engineering and the Research Chairs and Senior Research Fellowships scheme (grant RCSRF1819\8\25). We thank N. Sautter, P. Vieten and T. Adler for proposing the name for the project. We thank P. Bankhead, F. Hamprecht, H. Kenngott, D. Moher and B. Stieltjes for fruitful discussions on the framework. We thank S. Steger for the data protection supervision and A. Trotter for the hosting of the surveys. We thank L. Mais for instantiating the use case for InS of neurons from the fruit fly in 3D multicolor light microscopy images. We further thank the Janelia FlyLight Project Team for providing us with example images for this use case. We thank the following people for testing the metric mappings, reviewing the recommendations and performing metric-centric testing: T. Adler, C. Bender, A. B. Qasim, K. Dreher, N. Holzwarth, M. Hübner, D. Michael, L. -R. Müller, M. Rees, T. Rix, M. Schellenberg, S. Seidlitz, J. Sellner, A. Srivastava, F. Wolf, A. E. Yamlahi, S. D. Almeida, M. Baumgartner, D. Bounias, T. Bungert, M. Fischer, L. Klein, G. Köhler, B. Kovács, C. Lueth, T. Norajitra, C. Ulrich, T. Wald, I. Alekseenko, X. Liu, A. Marheim Storås and V. Thambawita. We thank the following people for taking our social media community survey and providing helpful feedback for improving the framework: Y. Akemi, R. Anteby, C. Arthurs, P. De Backer, H. Badgery, M. Baugh, J. Bernal, D. Bounias, F. C. Kitamura, J. Carse, C. Chen, I. Flipse, N. Gaggion, C. González, P. M. Gordaliza, T. Horeman, L. Joskowicz, A. Jose, A. Kamath, B. Kelly, Y. Kirchhoff, L. A. Kobelke, L. Krämer, M. Krendel, J. LaMaster, T. de Lange, J. L. Lavanchy, J. Li, C. Lüth, L. Mais, A. Marheim Storås, V. Nath, C. Scannell, C. Pape, M. P. Schijven, A. Selvanetti, B. S. Fadida, R. Staff, J. Tan, E. Tkaczyk, R. T. Calumby, A. Vlontzos, W. Zhang, C. Zhao and J. Zhu.
Funding Information:
We declare the following competing interests: Under terms of employment, M.B.B. is entitled to stock options in Mona.health, a KU Leuven spinoff. F.B. is an employee of Siemens AG. F.B. reports funding from Merck. B.v.G. is a shareholder of Thirona. B.G. was an employee of HeartFlow and Kheiron Medical Technologies. M.M.H. received an Nvidia GPU grant. B.K. is a consultant for ThinkSono. G.L. is on the advisory board of Canon Healthcare IT and is a shareholder of Aiosyn BV. N. Rieke is an employee of NVIDIA. J.S.-R. reports funding from GSK, Pfizer and Sanofi and fees from Travere Therapeutics, Stadapharm, Astex Therapeutics, Pfizer and Grunenthal. R.M.S. receives patent royalties from iCAD, ScanMed, Philips, Translation Holdings and PingAn; the laboratory of R.M.S. received research support from PingAn through a Cooperative Research and Development Agreement. S.A.T. receives financial support from Canon Medical Research Europe. The remaining authors declare no competing interests.
Publisher Copyright:
© Springer Nature America, Inc. 2024.
PY - 2024/2/12
Y1 - 2024/2/12
N2 - Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. In biomedical image analysis, chosen performance metrics often do not reflect the domain interest, and thus fail to adequately measure scientific progress and hinder translation of ML techniques into practice. To overcome this, we created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Developed by a large international consortium in a multistage Delphi process, it is based on the novel concept of a problem fingerprint—a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), dataset and algorithm output. On the basis of the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as classification tasks at image, object or pixel level, namely image-level classification, object detection, semantic segmentation and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. Its applicability is demonstrated for various biomedical use cases.
AB - Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. In biomedical image analysis, chosen performance metrics often do not reflect the domain interest, and thus fail to adequately measure scientific progress and hinder translation of ML techniques into practice. To overcome this, we created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Developed by a large international consortium in a multistage Delphi process, it is based on the novel concept of a problem fingerprint—a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), dataset and algorithm output. On the basis of the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as classification tasks at image, object or pixel level, namely image-level classification, object detection, semantic segmentation and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. Its applicability is demonstrated for various biomedical use cases.
KW - Algorithms
KW - Image Processing, Computer-Assisted
KW - Machine Learning
KW - Semantics
U2 - 10.1038/s41592-023-02151-z
DO - 10.1038/s41592-023-02151-z
M3 - Article
C2 - 38347141
SN - 1548-7105
VL - 21
SP - 195
EP - 212
JO - Nature Methods
JF - Nature Methods
IS - 2
ER -