Sunday, August 30, 2015

Chapter 2 Vocabulary

1. Posting skip pointers. for a postings list of length P, use P evenly-spaced skip pointers. This heuristic can be improved upon; it ignores any details of the distribution of query terms.

2. Most recent search engines support a double quotes syntax (“stanford university”) for phrase queries, which has proven to be very easily understood and successfully used by users.
 
3. The concept of a biword index can be extended to longer sequences of words, and if the index includes variable length word sequences, it is gen- erally referred to as a phrase index.
 
4.For the reasons given, a biword index is not the standard solution. Rather, a positional index is most commonly employed.
 
5.Let’s examine the space implications of having a positional index. A post- ing now needs an entry for each occurrence of a term. The index size thus depends on the average document size. The average web page has less than 1000 terms, but documents like SEC stock filings, books, and even some epic poems easily reach 100,000 terms. Consider a term with frequency 1 in 1000 terms on average. The result is that large documents cause an increase of two orders of magnitude in the space required to store the postings list:
Expected Expected entries Document size postings in positional posting
1000 1 1 100,000 1 100 

No comments:

Post a Comment