Zheng Gao's Wonderland: Week 5: Reading Notes

IIR Chapter 11:
In this chapter, the author mainly talks about using probabilities in information retrieval. And he also introduces several ways of probabilistic information retrieval. Users start with information needs, which they translate into query representations. And now,there are documents that are converted into document
representations.
In this chapter, the author at first introduce some basic knowledge of probability, which most of us have already learned during our high school study.then the author concentrates on the Binary Independence
Model, which is the original and still most influential probabilistic retrieval model. Finally, we will introduce related but extended methods which use term counts, including the empirically successful Okapi BM25 weighting scheme, and Bayesian Network models for IR.Those are all in probabilistic area. Which contains some mathematics questions and explanations.The author introduces partition rule and Bayes’ Rule, which is

At last, the author mentions a concept named ODDS. the odds of an event is to provide a kind of multiplier for how probabilities change.

In another section, the author introduces The 1/0 loss case,and a concept: P(R = 1|d, q). This is the basis of the Probability Ranking Principle.If a set of retrieval results is to be returned, rather than an ordering, the Bayes Optimal Decision Rule, the decision which minimizes the risk of loss, is to simply return documents that are more likely relevant than nonrelevant. the Probability Ranking Principle says that if for a specific document d and for all documents d′ not yet retrieved

then d is the next document to be retrieved.

The next section is about The Binary Independence Model.The author makes the conditional independence assumption that the presence or absence of aword in a document is independent of the presence or absence of any other word. In the end,The resulting quantity used for ranking is called the Value RETRIEVAL STATUS (RSV) in this model:

In the next part, the author introduces Probability estimates in theory.To avoid the possibility of zeroes (such as if every or no relevant document has a particular term) it is fairly standard to add 12 to each of the quantities.This is referred to as the relative frequency of the event.Estimating the probability as the relative frequency is the maximum likelihood estimate (or MLE),because this value makes the observed data maximally likely. And then the author tells a concept named Bayesian prior,This is a form of maximum a posterior (MAP) MAXIMUM A estimation, where we choose the most likely point value for probabilities based on the prior and the observed evidence.

In the part of Probability estimates in practice, the author lists that Croft and Harper (1979) proposed using a constant in their combination match model.Moreover, We can use (pseudo-)relevance feedback, perhaps in an iterative process of estimation, to get a more accurate estimate of pt.

Probabilistic methods are one of the oldest formal models in IR. In the Tree-structured dependencies between terms,Some of the assumptions of the BIM can be removed.

IIR Chapter 12:

This chapter is mainly about Language models for information retrieval.Instead of overtly modeling the probability P which is talked about in chapter 11, the 12th chapter has the basic language modeling approach instead builds a probabilistic language model Md from each document d, and ranks documents based on the probability of the model generating the query: P(q|Md).In the first part of the chapter, the author first introduce the concept of language models,and then talks about the Query Likelihood Model. In the end, the author also tells something about various extensions to the language modeling approach.

At the beginning of this chapter, the author introduces Finite automata and language models. A traditional

generative model of a language, of the kind familiar from formal language theory, can be used either to recognize or to generate strings. And A language model is a function that puts a probability measure over strings drawn from some vocabulary. As for the types of the language models, The simplest form of language model simply throws away all conditioning context, and estimates each term independently,which we called it unigram language model. In the end of the section, the professor tells us"The strategy we adopt in IR is as follows. We pretend that the document d is only a representative sample of text drawn from a model distribution, treating it like a fine-grained topic. We then estimate a language model from this sample, and use that model to calculate the probability of observing any word sequence, and, finally,we rank documents according to their probability of generating the query".That is the strategy of language models in IR.

In the next several sections, the author tells us a lot kinds of language models. Language modeling is a quite general formal approach to IR,with many variant realizations. The original and basic method for using language models in IR is the query likelihood model. The most common way to do this is using the multinomial unigram languagemodel, which is equivalent to amultinomial Naive Bayesmodel (page 263),

where the documents are the classes, each treated in the estimation as a separate “language”.

In the end, the author concludes that the retrieval ranking for a query q under the basic LM for

IR he has

been considering is given by:

The next section is about the comparison between Language modeling and other approaches in IR.Compared to other probabilistic approaches, such as the BIM from Chapter 11, the main difference initially appears to be that the LM approach does away with explicitly modeling relevance (whereas this is the central variable evaluated in the BIM approach).The model has significant relations to traditional tf-idf models. Also the professor lists three ways of developing the language modeling approach: query likelihood, document likelihood, and model comparison.

The Paper:

This paper is mainly talks about the comparison between the traditional IR model and the new language model in IR. There are three traditional IR model: the boolean model, the vector model and the probabilistic model.And the new language model is based on the vector model in tf, idf terms and the probabilistic model in relevance weighting. The vector model and the probabilistic model stands for different approaches to information retrieval. The former is based on the similarity between query and document, the latter is based on the probability of relevance, using the distribution of terms over relevant and non-relevant documents.However, the author finds some interesting things in language models.And he presents a strong theoretical motivation of the language modelling approach and shows that the approach outperforms the weighting algorithms developed within traditional models.

After discussing some traditional models' features, the author begins to introduce statistical language model of retrieval.The author uses urn model as a metaphor to illustrate language tools. And then he also introduces ad-hoc retrieval task. He uses the traditional models as the foundation and gives a definition of the corresponding probability measures. Also he gives some parameter estimation by using tf and idf. In the end, he shows the results of the ad hoc task. The results shows that both the original probabilistic model and the original vector space model underperform on this task.And the language model shares some same features with the traditional models. The paper in the end introduces new ways of thinking about two popular information retrieval tools: the use of stop words and the use of a stemmer.

Zheng Gao's Wonderland

Wednesday, February 5, 2014

Week 5: Reading Notes

No comments:

Post a Comment

Zheng Gao