Near Duplicate Text Detection Using Frequency-Biased Signatures

Yifang Sun, Jianbin Qin, Wei Wang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract / Description of output

As the use of electronic documents are becoming more popular, people want to find documents completely or partially duplicate. In this paper, we propose a near duplicate text detection framework using signatures to save space and query time. We also propose a novel signature selection algorithm which uses collection frequency of q-grams. We compare our algorithm with Winnowing, which is one of the state-of-the-art signature selection algorithms. We show that our algorithm acquires much better accuracy with less time and space cost. We perform extensive experiments to verify our conclusion.
Original languageEnglish
Title of host publicationProceedings of the 14th International Conference on Web Information Systems Engineering – WISE 2013
Place of PublicationNanjing, China
PublisherSpringer
Pages277-291
Number of pages15
ISBN (Electronic)978-3-642-41230-1
ISBN (Print)978-3-642-41229-5
DOIs
Publication statusPublished - 2013
Event14th International Conference on Web Information System Engineering - Nanjing, China
Duration: 13 Oct 201315 Oct 2013
http://wise2013.njue.edu.cn/

Conference

Conference14th International Conference on Web Information System Engineering
Abbreviated titleWISE 2013
Country/TerritoryChina
CityNanjing
Period13/10/1315/10/13
Internet address

Fingerprint

Dive into the research topics of 'Near Duplicate Text Detection Using Frequency-Biased Signatures'. Together they form a unique fingerprint.

Cite this