Show simple item record

dc.contributor.authorLin, Wu
dc.date.accessioned2015-01-16 18:18:25 (GMT)
dc.date.available2015-01-16 18:18:25 (GMT)
dc.date.issued2015-01-16
dc.date.submitted2015
dc.identifier.urihttp://hdl.handle.net/10012/9082
dc.description.abstractRetrieval systems help users to isolate relevant information from massive data collections. Usually, a user obtains useful information by submitting a query to such a system. One critical issue is that a query could have many subtopics. A Web query ``apple products" is a case. The query may indicate that a user wants to find Web pages related to iPhones or products made from the fruit ``apple". Determining which is relevant is difficult without feedback from the user. Query-specific clustering is one approach used to discover relevant aspects of a query by grouping relevant documents into clusters. In this approach, each cluster represents a relevant aspect of the query. We study Chinese restaurant process mixture models as clustering algorithms in this approach. To the best of our knowledge, our work is the first that studies such models in this context. Classical clustering models such as K-means and K-mixture Gaussian models have to first guess the number of clusters, K, and then estimate clusters from data. Chinese restaurant process mixture models can simultaneously learn the number of clusters and the actual clusters from data. This thesis first reviews K-means, K-mixture Gaussian models and Bayesian K-mixture models. Then we review Chinese restaurant process mixture models. The Chinese restaurant process mixture models are extensions of the Bayesian models where K is not required to be finite. Among these mixture models, we pay attention to distance-dependent Chinese restaurant process mixture models since external pairwise measures can be used in modeling. Then, we propose two similarity-like measures used for the Chinese restaurant process mixture models in information retrieval. Finally, a Gibbs sampling scheme for both types of models is reviewed. Then the models' performance in the pseudo-relevance feedback via query expansion tasks is tested through experiments. In this task, top-retrieved documents are considered as relevant documents, and here we use a collection of documents from the Robust track of TREC 2004. We investigate the effectiveness of these Chinese restaurant process mixture models in three query sets, each of which contains 50 queries and relevance judgments. To confirm the robustness of these models, sensitivity analysis of the hyper-parameters is conducted. Results show that the Chinese restaurant process mixture models perform better than baseline models used in the feedback task, and are not sensitive when their hyper-parameters are reasonably selected. The proposed measures used in the distance-dependent Chinese restaurant process mixture models perform comparably. On the other hand, the proposed measures barely help these models to outperform the standard Chinese restaurant process mixture models.en
dc.language.isoenen
dc.publisherUniversity of Waterlooen
dc.subjectDirichlet processen
dc.subjectChinese Restaurant Processen
dc.subjectGibbs samplingen
dc.subjectMixture Modelsen
dc.subjectInformation Retrievalen
dc.titleA Study of Using Chinese Restaurant Process Mixture Models in Information Retrievalen
dc.typeMaster Thesisen
dc.pendingfalse
dc.subject.programStatisticsen
uws-etd.degree.departmentStatistics and Actuarial Scienceen
uws-etd.degreeMaster of Mathematicsen
uws.typeOfResourceTexten
uws.peerReviewStatusUnrevieweden
uws.scholarLevelGraduateen


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record


UWSpace

University of Waterloo Library
200 University Avenue West
Waterloo, Ontario, Canada N2L 3G1
519 888 4883

All items in UWSpace are protected by copyright, with all rights reserved.

DSpace software

Service outages