Predicting the neutral hydrogen content of galaxies from optical data using machine learning

Mika Rafieferantsoa, Sambatra Andrianomena, Romeel Davé

Research output: Contribution to journalArticlepeer-review


We develop a machine learning-based framework to predict the HI content of galaxies from optical photometry and environmental parameters. We train the algorithm on z = 0-2 outputs from the MUFASA cosmological hydrodynamic simulation, which includes star formation, feedback, and a heuristic model to quench massive galaxies that yields a reasonable match to a range of survey data including HI.We employ a variety of machine learning methods (regressors), and quantify their performance using the slope of the predicted versus true relation, its root mean square error (RMSE), and Pearson correlation coefficient (r). Training on only Sloan Digital Sky Survey photometry, all regressors give r > 0.8 and RMSE ~ 0.3 at z = 0, led by random forests with r = 0.91, and a deep neural network (DNN) with comparable accuracy (r = 0.9). Adding near-IR photometry improves all regressors. All regressors perform worse with redshift, particularly at z ≲ 1. Slope values are generally sub-linear, so that we overpredict HI in HI-poor galaxies and underpredict HI rich, because the regressors do not fully capture the scatter in the data. We test our framework on REsolved Spectroscopy Of a Local VolumE (RESOLVE) and Arecibo Legacy Fast ALFA (ALFALFA) survey data. Training on a subset of the observations, we find that our machine learning method can reasonably predict H Irichnesses in the remaining data (RMSE ~ 0.28 for RESOLVE and ~0.25 for ALFALFA). Training on mock data from MUFASA to predict observed data is worse (RMSE ~ 0.45 for RESOLVE and 0.31 for ALFALFA), with DNN well outperforming other regressors. Our method will be useful for making galaxy-by-galaxy survey predictions and incompleteness corrections for upcoming HI 21 cm surveys on Square Kilometre Array precursors such as MeerKAT, over regions where photometry is already available.

Original languageEnglish
Pages (from-to)4509-4525
Number of pages17
JournalMonthly Notices of the Royal Astronomical Society
Issue number4
Early online date5 Jul 2018
Publication statusPublished - 1 Oct 2018


  • Galaxies: evolution
  • Galaxies: statistics
  • Methods: numerical


Dive into the research topics of 'Predicting the neutral hydrogen content of galaxies from optical data using machine learning'. Together they form a unique fingerprint.

Cite this