dc.contributor.author | Khoshdel Nikkhoo, Hani | |
dc.date.accessioned | 2011-01-21 16:53:50 (GMT) | |
dc.date.available | 2011-01-21 16:53:50 (GMT) | |
dc.date.issued | 2011-01-21T16:53:50Z | |
dc.date.submitted | 2011-01-18 | |
dc.identifier.uri | http://hdl.handle.net/10012/5750 | |
dc.description.abstract | Near-duplicate documents can adversely affect the efficiency and
effectiveness of search engines.
Due to the pairwise nature of the comparisons required for near-duplicate
detection, this process is extremely costly in terms of the time and
processing power it requires.
Despite the ubiquitous presence of near-duplicate detection algorithms
in commercial search engines, their application and impact in research
environments is not fully explored.
The implementation of near-duplicate detection algorithms forces trade-offs
between efficiency and effectiveness, entailing careful testing and
measurement to ensure acceptable performance.
In this thesis, we describe and evaluate a scalable implementation of a
near-duplicate detection algorithm, based on standard shingling techniques,
running under a MapReduce framework.
We explore two different shingle sampling techniques and analyze
their impact on the near-duplicate document detection process.
In addition, we investigate the prevalence of near-duplicate documents
in the runs submitted to the adhoc task of TREC 2009 web track. | en |
dc.language.iso | en | en |
dc.publisher | University of Waterloo | en |
dc.subject | near-duplicate detection | en |
dc.subject | MapReduce | en |
dc.subject | shingles | en |
dc.title | The Impact of Near-Duplicate Documents on Information Retrieval Evaluation | en |
dc.type | Master Thesis | en |
dc.pending | false | en |
dc.subject.program | Computer Science | en |
uws-etd.degree.department | School of Computer Science | en |
uws-etd.degree | Master of Mathematics | en |
uws.typeOfResource | Text | en |
uws.peerReviewStatus | Unreviewed | en |
uws.scholarLevel | Graduate | en |