Calculating Error Bars on Inferences from Web Data

Research output: Chapter in Book/Report/Conference proceedingConference contribution


In this work, we explore uncertainty in automated question answering over real-valued data from knowledge bases on the Internet. We argue that the coefficient of variation (cov) is an intuitive and general form in which to express this uncertainty, with the added advantage that it can be calculated exactly and efficiently. The large amounts of data on the Internet presents a good opportunity to answer queries that go beyond simply looking up facts and returning them. However, such data is often vague and noisy. For discrete results, e.g., stating that a particular city is the capital of a particular country, probabilities are a natural way to assign uncertainty to answers. For continuous variables or quantities that are typically treated as continuous (such as populations of countries), probabilities are uninformative, being infinitesimal For instance, the probability that the population of India is exactly equal to last census count is effectively zero. Our aim is to capture uncertainty in these estimates in an intuitive, uniform, and computationally efficient way. We present initial efforts at automating the inference process over real-valued web data while accounting for some of the typical sources of uncertainty: noisy data and errors from inference operations. Having considered several problem domains and query types, we find that approximating all continuous random variables with Gaussian distributions, and communicating uncertainties to users as coefficients of variation. Our experiments show that the estimates of uncertainty derived by our method are well-calibrated and correlate with the actual deviations from the true answer. An immediate benefit of our approach is that our inference framework can attach credible intervalsto real-valued answers that it infers. This conveys to a user the plausible magnitudes of the error in the answer, a meaningful measure of uncertainty compared to ranking scores provided in other question answering systems. 1We will use symmetric 68.27 percent credible intervals for the remainder of this paper, corresponding to 1 standard deviation from the mean in a standarized Gaussian, but note that this contains sufficient information to estimate arbitrary posterior probabilities under our assumption of normality.
Original languageEnglish
Title of host publicationSAI Intelligent Systems Conference (IntelliSys)
Place of PublicationLondon, United Kingdom
PublisherSpringer, Cham
Number of pages23
ISBN (Electronic)978-3-030-01057-7
ISBN (Print)978-3-030-01056-0
Publication statusPublished - 8 Nov 2018
EventIntelligent Systems Conference (IntelliSys) 2018 - London, United Kingdom
Duration: 6 Sep 20187 Sep 2018

Publication series

NameAdvances in Intelligent Systems and Computing (AISC)
PublisherSpringer, Cham
ISSN (Print)2194-5357
ISSN (Electronic)2194-5365


ConferenceIntelligent Systems Conference (IntelliSys) 2018
Abbreviated titleIntelliSys 2018
Country/TerritoryUnited Kingdom
Internet address


  • Query Answering
  • Error Bars
  • Uncertainty
  • Bayesian Inference
  • Coefficient of Variation


Dive into the research topics of 'Calculating Error Bars on Inferences from Web Data'. Together they form a unique fingerprint.

Cite this