2. Most recent search engines support a double quotes syntax (“stanford university”) for phrase queries, which has proven to be very easily understood and successfully used by users.
3. The concept of a biword index can be extended to longer sequences of words, and if the index includes variable length word sequences, it is gen- erally referred to as a phrase index.
4.For the reasons given, a biword index is not the standard solution. Rather, a positional index is most commonly employed.
5.Let’s examine the space implications of having a positional index. A post- ing now needs an entry for each occurrence of a term. The index size thus depends on the average document size. The average web page has less than 1000 terms, but documents like SEC stock filings, books, and even some epic poems easily reach 100,000 terms. Consider a term with frequency 1 in 1000 terms on average. The result is that large documents cause an increase of two orders of magnitude in the space required to store the postings list:
Expected Expected entries
Document size postings in positional posting
1000 1 1 100,000 1 100
1000 1 1 100,000 1 100
No comments:
Post a Comment