In-Database Data Imputation

Massimo Perini, Milos Nikolic

Research output: Contribution to journalArticlepeer-review

Abstract

Missing data is a widespread problem in many domains, creating challenges in data analysis and decision making. Traditional techniques for dealing with missing data, such as excluding incomplete records or imputing simple estimates (e.g., mean), are computationally efficient but may introduce bias and disrupt variable relationships, leading to inaccurate analyses. Model-based imputation techniques offer a more robust solution that preserves the variability and relationships in the data, but they demand significantly more computation time, limiting their applicability to small datasets. This work enables efficient, high-quality, and scalable data imputation within a database system using the widely used MICE method. We adapt this method to exploit computation sharing and a ring abstraction for faster model training. To impute both continuous and categorical values, we develop techniques for in-database learning of stochastic linear regression and Gaussian discriminant analysis models. Our MICE implementations in PostgreSQL and DuckDB outperform alternative MICE implementations and model-based imputation techniques by up to two orders of magnitude in terms of computation time, while maintaining high imputation quality.
Original languageEnglish
Article number70
Pages (from-to)1-27
Number of pages27
JournalProceedings of the ACM on Management of Data
Volume2
Issue number1
DOIs
Publication statusPublished - 26 Mar 2024
Event2024 SIGMOD/PODS International Conference on Management of Data - Santiago, Chile
Duration: 9 Jun 202415 Jun 2024
https://2024.sigmod.org/calls_papers_important_dates.shtml

Keywords / Materials (for Non-textual outputs)

  • MICE
  • factorized computation
  • incomplete data
  • missing data
  • ring

Fingerprint

Dive into the research topics of 'In-Database Data Imputation'. Together they form a unique fingerprint.

Cite this