A Study of Using Chinese Restaurant Process Mixture Models in Information Retrieval

Lin, Wu

dc.contributor.author	Lin, Wu
dc.date.accessioned	2015-01-16 18:18:25 (GMT)
dc.date.available	2015-01-16 18:18:25 (GMT)
dc.date.issued	2015-01-16
dc.date.submitted	2015
dc.identifier.uri	http://hdl.handle.net/10012/9082
dc.description.abstract	Retrieval systems help users to isolate relevant information from massive data collections. Usually, a user obtains useful information by submitting a query to such a system. One critical issue is that a query could have many subtopics. A Web query ``apple products" is a case. The query may indicate that a user wants to find Web pages related to iPhones or products made from the fruit ``apple". Determining which is relevant is difficult without feedback from the user. Query-specific clustering is one approach used to discover relevant aspects of a query by grouping relevant documents into clusters. In this approach, each cluster represents a relevant aspect of the query. We study Chinese restaurant process mixture models as clustering algorithms in this approach. To the best of our knowledge, our work is the first that studies such models in this context. Classical clustering models such as K-means and K-mixture Gaussian models have to first guess the number of clusters, K, and then estimate clusters from data. Chinese restaurant process mixture models can simultaneously learn the number of clusters and the actual clusters from data. This thesis first reviews K-means, K-mixture Gaussian models and Bayesian K-mixture models. Then we review Chinese restaurant process mixture models. The Chinese restaurant process mixture models are extensions of the Bayesian models where K is not required to be finite. Among these mixture models, we pay attention to distance-dependent Chinese restaurant process mixture models since external pairwise measures can be used in modeling. Then, we propose two similarity-like measures used for the Chinese restaurant process mixture models in information retrieval. Finally, a Gibbs sampling scheme for both types of models is reviewed. Then the models' performance in the pseudo-relevance feedback via query expansion tasks is tested through experiments. In this task, top-retrieved documents are considered as relevant documents, and here we use a collection of documents from the Robust track of TREC 2004. We investigate the effectiveness of these Chinese restaurant process mixture models in three query sets, each of which contains 50 queries and relevance judgments. To confirm the robustness of these models, sensitivity analysis of the hyper-parameters is conducted. Results show that the Chinese restaurant process mixture models perform better than baseline models used in the feedback task, and are not sensitive when their hyper-parameters are reasonably selected. The proposed measures used in the distance-dependent Chinese restaurant process mixture models perform comparably. On the other hand, the proposed measures barely help these models to outperform the standard Chinese restaurant process mixture models.	en
dc.language.iso	en	en
dc.publisher	University of Waterloo	en
dc.subject	Dirichlet process	en
dc.subject	Chinese Restaurant Process	en
dc.subject	Gibbs sampling	en
dc.subject	Mixture Models	en
dc.subject	Information Retrieval	en
dc.title	A Study of Using Chinese Restaurant Process Mixture Models in Information Retrieval	en
dc.type	Master Thesis	en
dc.pending	false
dc.subject.program	Statistics	en
uws-etd.degree.department	Statistics and Actuarial Science	en
uws-etd.degree	Master of Mathematics	en
uws.typeOfResource	Text	en
uws.peerReviewStatus	Unreviewed	en
uws.scholarLevel	Graduate	en

Files in this item

Name:: Lin_Wu.pdf
Size:: 539.7Kb
Format:: PDF

View/ Open

This item appears in the following Collection(s)

Show simple item record