The Impact of Near-Duplicate Documents on Information Retrieval Evaluation

Khoshdel Nikkhoo, Hani

dc.contributor.author	Khoshdel Nikkhoo, Hani
dc.date.accessioned	2011-01-21 16:53:50 (GMT)
dc.date.available	2011-01-21 16:53:50 (GMT)
dc.date.issued	2011-01-21T16:53:50Z
dc.date.submitted	2011-01-18
dc.identifier.uri	http://hdl.handle.net/10012/5750
dc.description.abstract	Near-duplicate documents can adversely affect the efficiency and effectiveness of search engines. Due to the pairwise nature of the comparisons required for near-duplicate detection, this process is extremely costly in terms of the time and processing power it requires. Despite the ubiquitous presence of near-duplicate detection algorithms in commercial search engines, their application and impact in research environments is not fully explored. The implementation of near-duplicate detection algorithms forces trade-offs between efficiency and effectiveness, entailing careful testing and measurement to ensure acceptable performance. In this thesis, we describe and evaluate a scalable implementation of a near-duplicate detection algorithm, based on standard shingling techniques, running under a MapReduce framework. We explore two different shingle sampling techniques and analyze their impact on the near-duplicate document detection process. In addition, we investigate the prevalence of near-duplicate documents in the runs submitted to the adhoc task of TREC 2009 web track.	en
dc.language.iso	en	en
dc.publisher	University of Waterloo	en
dc.subject	near-duplicate detection	en
dc.subject	MapReduce	en
dc.subject	shingles	en
dc.title	The Impact of Near-Duplicate Documents on Information Retrieval Evaluation	en
dc.type	Master Thesis	en
dc.pending	false	en
dc.subject.program	Computer Science	en
uws-etd.degree.department	School of Computer Science	en
uws-etd.degree	Master of Mathematics	en
uws.typeOfResource	Text	en
uws.peerReviewStatus	Unreviewed	en
uws.scholarLevel	Graduate	en

Files in this item

Name:: Khoshdel_Nikkhoo_Hani.pdf
Size:: 1.420Mb
Format:: PDF

View/ Open

This item appears in the following Collection(s)

Show simple item record