Javed A. Aslam
Harriet Fell, Rajmohan Rajaraman, Avi Arampatzis
Date of Award
Doctor of Philosophy
Department or Academic Unit
College of Computer and Information Science
evaluation, information retrieval, relevance, score distributions
Computer Sciences | Databases and Information Systems | Theory and Algorithms
When a user submits a query to a search engine, the search engine computes a score for each document according to its relevance to the query, and ranks the documents based on their scores. Due to the complexity of the modern search engine, the score itself is not sufficient for the information retrieval application requiring combining different ranked lists. Inferring the score distributions for relevant and non-relevant documents and estimating the probability of relevance become imperative. In this thesis, we address two major research questions: (1) How to model score distributions in a more accurate manner for relevant and non-relevant documents? (2) How can score distributions be better inferred in practice when the relevance information is absent?
In the first part of the thesis, we show the existing problems of today's most widely used score distribution model, and propose to model the relevant document scores by a mixture of Gaussian distributions and the non-relevant scores by a Gamma distribution. Score distributions are further modeled in a more systematic manner. With a basic assumption of the distribution of terms in a document, the distribution of the produced scores for retrieved documents can be derived through the transformations applied on the term frequency. Meanwhile, the score distribution of relevant documents can also be derived through a general mathematical framework given the score distribution for all retrieved documents.
The second part of the thesis presents a new framework for inferring score distributions when the relevance information is unavailable. The new inference process extends the expectation maximization algorithm by simultaneously considering the ranked lists of documents returned by multiple retrieval systems, and encodes the constraint that the same document retrieved by multiple systems should have the same, global, probability of relevance. Combined, we demonstrate that it is more effective when it is applied on the task of metasearch.
Dai, Keshi, "Modeling score distributions for information retrieval" (2012). Computer Science Dissertations. Paper 16. http://hdl.handle.net/2047/d20002670
Click button above to open, or right-click to save.COinS