Show simple item record

dc.contributor.authorKhoshdel Nikkhoo, Hani
dc.date.accessioned2011-01-21 16:53:50 (GMT)
dc.date.available2011-01-21 16:53:50 (GMT)
dc.date.issued2011-01-21T16:53:50Z
dc.date.submitted2011-01-18
dc.identifier.urihttp://hdl.handle.net/10012/5750
dc.description.abstractNear-duplicate documents can adversely affect the efficiency and effectiveness of search engines. Due to the pairwise nature of the comparisons required for near-duplicate detection, this process is extremely costly in terms of the time and processing power it requires. Despite the ubiquitous presence of near-duplicate detection algorithms in commercial search engines, their application and impact in research environments is not fully explored. The implementation of near-duplicate detection algorithms forces trade-offs between efficiency and effectiveness, entailing careful testing and measurement to ensure acceptable performance. In this thesis, we describe and evaluate a scalable implementation of a near-duplicate detection algorithm, based on standard shingling techniques, running under a MapReduce framework. We explore two different shingle sampling techniques and analyze their impact on the near-duplicate document detection process. In addition, we investigate the prevalence of near-duplicate documents in the runs submitted to the adhoc task of TREC 2009 web track.en
dc.language.isoenen
dc.publisherUniversity of Waterlooen
dc.subjectnear-duplicate detectionen
dc.subjectMapReduceen
dc.subjectshinglesen
dc.titleThe Impact of Near-Duplicate Documents on Information Retrieval Evaluationen
dc.typeMaster Thesisen
dc.pendingfalseen
dc.subject.programComputer Scienceen
uws-etd.degree.departmentSchool of Computer Scienceen
uws-etd.degreeMaster of Mathematicsen
uws.typeOfResourceTexten
uws.peerReviewStatusUnrevieweden
uws.scholarLevelGraduateen


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record


UWSpace

University of Waterloo Library
200 University Avenue West
Waterloo, Ontario, Canada N2L 3G1
519 888 4883

All items in UWSpace are protected by copyright, with all rights reserved.

DSpace software

Service outages