Robust and Noise Resistant Wrapper Induction

Tim Furche, Jinsong Guo, Sebastian Maneth, Christian Schallhart

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Wrapper induction is the problem of automatically inferring a query from annotated web pages of the same template. This query should not only select the annotated content accurately but also other content following the same template. Beyond accurately matching the template, we consider two additional requirements: (1) wrappers should be robust against a large class of changes to the web pages, and (2) the induction process should be noise resistant, i.e., tolerate slightly erroneous (e.g., machine generated) samples. Key to our approach is a query language that is powerful enough to permit accurate selection, but limited enough to force noisy samples to be generalized into wrappers that select the likely intended items. We introduce such a language as subset of XPATH and show that even for such a restricted language, inducing optimal queries according to a suitable scoring is infeasible. Nevertheless, our wrapper induction framework infers highly robust and noise resistant queries. We evaluate the queries on snapshots from web pages that change over time as provided by the Internet Archive, and show that the induced queries are as robust as the human-made queries. The queries often survive hundreds sometimes thousands of days, with many changes to the relative position of the selected nodes (including changes on template level). This is due to the few and discriminative anchor (intermediately selected) nodes of the generated queries. The queries are highly resistant against positive noise (up to 50%) and negative noise (up to 20%).
Original languageEnglish
Title of host publicationSIGMOD '16 Proceedings of the 2016 International Conference on Management of Data
PublisherACM
Pages773-784
Number of pages13
ISBN (Electronic)978-1-4503-3531-7
DOIs
Publication statusPublished - 26 Jun 2016
Event2016 International Conference on Management of Data - San Francisco, United States
Duration: 26 Jun 20161 Jul 2016
http://sigmod2016.org/

Conference

Conference2016 International Conference on Management of Data
Abbreviated titleSIGMOD/PODS'16
Country/TerritoryUnited States
CitySan Francisco
Period26/06/161/07/16
Internet address

Fingerprint

Dive into the research topics of 'Robust and Noise Resistant Wrapper Induction'. Together they form a unique fingerprint.

Cite this