Simple Yet Effective Pseudo Relevance Feedback with Rocchio’s Technique and Text Classification

Liu, Yuqi

dc.contributor.author	Liu, Yuqi
dc.date.accessioned	2022-08-22 19:17:33 (GMT)
dc.date.available	2022-08-22 19:17:33 (GMT)
dc.date.issued	2022-08-22
dc.date.submitted	2022-08-12
dc.identifier.uri	http://hdl.handle.net/10012/18603
dc.description.abstract	With the continuous growth of the Internet and the availability of large-scale collections, assisting users in locating the information they need becomes a necessity. Generally, an information retrieval system will process an input query and provide a list of ranked results. However, this process could be challenging due to the "vocabulary mismatch" issue between input queries and passages. A well-known technique to address this issue is called "query expansion", which reformulates the given query by selecting and adding more relevant terms. Relevance feedback, as a form of query expansion, collects users' opinions on candidate passages and expands query terms from relevant ones. Pseudo relevance feedback assumes that the top documents in initial retrieval are relevant and rebuilds queries without any user interactions. In this thesis, we will discuss two implementations of pseudo relevance feedback: decades-old Rocchio's Technique and more recent text classification. As the reader might notice, both techniques are not "novel" anymore, e.g., the emergence of Rocchio can even be dated back to the 1960s. They are both proposed and studied before the neural age, where texts are still mostly stored as bag-of-words representations. Today, transformers have been shown to advance information retrieval, and searching with transformer-based dense representations outperforms traditional bag-of-words searching on many challenging and complex ranking tasks. This motivates us to ask the following three research questions: RQ1: Given strong baselines, large labelled datasets, and the emergence of transformers today, does pseudo relevance feedback with Rocchio's Technique still perform effectively with both sparse and dense representations? RQ2: Given strong baselines, large labelled datasets, and the emergence of transformers today, does pseudo relevance feedback via text classification still perform effectively with both sparse and dense representations? RQ3: Does applying pseudo relevance feedback with text classification on top of Rocchio's Technique results in further improvements? To answer RQ1, we have implemented Rocchio's Technique with sparse representations based on the Anserini and Pyserini toolkits. Building in a previous implementation of Rocchio's Technique with dense representations in the Pyserini toolkit, we can easily evaluate and compare the impact of Rocchio's Technique on effectiveness with both sparse and dense representations. By applying Rocchio's Technique to MS MARCO Passage and Document TREC Deep Learning topics, we can achieve about a 0.03-0.04 increase in average precision. It’s no surprise that Rocchio's Technique outperforms the BM25 baseline, but it's impressive to find that it is competitive or even superior to RM3, a more common strong baseline, under most circumstances. Hence, we propose to switch to Rocchio's Technique as a more robust and general baseline in future studies. To our knowledge, pseudo relevance feedback via text classification using both positive and negative labels is not well-studied before our work. To answer RQ2, we have verified the effectiveness of pseudo relevance feedback via text classification with both sparse and dense representations. Three classifiers (LR, SVM, KNN) are trained, and all enhance effectiveness. We also observe that pseudo relevance feedback via text classification with dense representations yields greater improvement than sparse ones. However, when we compare text classification to Rocchio's Technique, we find that Rocchio's Technique is superior to pseudo relevance feedback via text classification under all circumstances. In RQ3, the success of pseudo relevance feedback via text classification on BM25 + RM3 across four newswire collections in our previous paper motivates us to study the impact of pseudo relevance feedback via text classification on top of another query expansion result, Rocchio's Technique. However, unlike RM3, we could not observe much difference in the two evaluation metrics after applying pseudo relevance feedback via text classification on top of Rocchio's Technique. This work aims to explore some simple yet effective techniques which might be ignored in light of deep learning transformers. Instead of pursuing "more", we are aiming to find out something "less". We demonstrate the robustness and effectiveness of some "out-of-date" methods in the age of neural networks	en
dc.language.iso	en	en
dc.publisher	University of Waterloo	en
dc.subject	information retrieval	en
dc.subject	pseudo relevance feedback	en
dc.subject	query expansion	en
dc.subject	text classification	en
dc.title	Simple Yet Effective Pseudo Relevance Feedback with Rocchio’s Technique and Text Classification	en
dc.type	Master Thesis	en
dc.pending	false
uws-etd.degree.department	Data Science	en
uws-etd.degree.discipline	Data Science	en
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.degree	Master of Mathematics	en
uws-etd.embargo.terms	0	en
uws.contributor.advisor	Lin, Jimmy
uws.contributor.affiliation1	Faculty of Mathematics	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.typeOfResource	Text	en
uws.peerReviewStatus	Unreviewed	en
uws.scholarLevel	Graduate	en

Files in this item

Name:: Liu_Yuqi.pdf
Size:: 620.8Kb
Format:: PDF

View/ Open

This item appears in the following Collection(s)

Show simple item record