Zheng Gao's Wonderland: Week 3: Reading Notes

IIR Chapter 4:

In this chapter, the author focus on index construction, especially invert index.Then a blocked sort-based index was introduced. Distributed index and dynamic index are also introduced in this chapter.

Hardware is the basic character when people build an information retrieval system. Operating systems generally read and write entire blocks. Buffer is the place to read and write blocks in main memory.Because of the limitation of main memory, we need to use external sorting algorithm.Blocked sort-based indexing is an efficient way to raise up the retrieval speed. Then the author illustrate how BSBI process information.

However, BSBI is not suitable for very large collections, then a method called single-pass in-memory indexing is introduced.A difference between BSBI and SPIMI is that SPIMI adds a posting directly to its postings list. Instead of first collecting all termID–docID pairs and then sorting them(as we did in BSBI), each postings list is dynamic (i.e., its size is adjusted as it grows) and it is immediately available to collect postings.

Moreover, because the collection are always so large, a single machine can't perform index construction efficiently. So we begin to use distributed indexing. The author also introduce some basic parts about MapReduce, a general architecture for distributed indexing. MapReduce uses node to deal indexing with key-value pair by "map" and "reduce" process.
In dynamic indexing, a good solution is to have a large main index and a small auxiliary index stored in memory.But having multiple indexes complicates the maintenance of collection-wide statistics.

IIR Chapter 5:
In the chapter, the author introduces a number of compression techniques for dictionary and inverted index that are essential for efficient IR systems. It increases the using of caching and makes the data transfer from disk to memory more faster.At first, the author gives us the characteristics of the distribution of entities that we want to compress. Then the author gives us some compression technologies for dictionary and invert indexing.
At the beginning , a rule of 30 is introduced. In addition, the author claims that all compression technologies introduced here are lossless compression.The author also introduces a law named Heaps' law to estimate vocabulary size as a function of collection size: M = kTb. Moreover, A commonly used model of the distribution of terms in a collection is Zipf’s law.
Dictionary as a string can save almost 60% space. And blocked storage also can save much space in memory.
Variable byte (VB) encoding uses an integral number of bytes to encode a gap. γ codes is another way to compression. γ codes implement variable-length encoding by splitting the representation of a gap G into a pair of length and offset.

Zheng Gao's Wonderland

Tuesday, January 14, 2014

Week 3: Reading Notes

No comments:

Post a Comment

Zheng Gao