Advisor(s)

Javed A. Aslam

Contributor(s)

Harriet Fell, Rajmohan Rajaraman, Avi Arampatzis

Date of Award

4-2012

Date Accepted

4-26-2012

Degree Grantor

Northeastern University

Degree Level

Ph.D.

Degree Name

Doctor of Philosophy

Department or Academic Unit

College of Computer and Information Science

Keywords

evaluation, information retrieval, relevance, score distributions

Disciplines

Computer Sciences | Databases and Information Systems | Theory and Algorithms

Abstract

When a user submits a query to a search engine, the search engine computes a score for each document according to its relevance to the query, and ranks the documents based on their scores. Due to the complexity of the modern search engine, the score itself is not sufficient for the information retrieval application requiring combining different ranked lists. Inferring the score distributions for relevant and non-relevant documents and estimating the probability of relevance become imperative. In this thesis, we address two major research questions: (1) How to model score distributions in a more accurate manner for relevant and non-relevant documents? (2) How can score distributions be better inferred in practice when the relevance information is absent?

In the first part of the thesis, we show the existing problems of today's most widely used score distribution model, and propose to model the relevant document scores by a mixture of Gaussian distributions and the non-relevant scores by a Gamma distribution. Score distributions are further modeled in a more systematic manner. With a basic assumption of the distribution of terms in a document, the distribution of the produced scores for retrieved documents can be derived through the transformations applied on the term frequency. Meanwhile, the score distribution of relevant documents can also be derived through a general mathematical framework given the score distribution for all retrieved documents.

The second part of the thesis presents a new framework for inferring score distributions when the relevance information is unavailable. The new inference process extends the expectation maximization algorithm by simultaneously considering the ranked lists of documents returned by multiple retrieval systems, and encodes the constraint that the same document retrieved by multiple systems should have the same, global, probability of relevance. Combined, we demonstrate that it is more effective when it is applied on the task of metasearch.

Document Type

Dissertation

Rights Holder

Keshi Dai



Click button above to open, or right-click to save.

Share

COinS