Wednesday, January 8, 2014

Week 2: Reading Notes

IIR Section 1.2, Chapter 2, Chapter 3:
    Firstly, the author introduces a concept named inverted index. At first the documents are divided into tokens. And then regrouped and gathered into a form of inverted index. There are two ways of storage: single linked list and variable length array. By using these two ways, it can make the system more efficient and save time.
    In the part of "The term vocabulary and posting lists", the author reviews some major steps in inverted index construction. Also at the beginning of the chapter, the basic components of documents and how to make sure the sequence character are introduced. Then the article tells tokenization and how to build index to find the most accurate items in the most efficient way. In the last two parts of the second chapter, the author tells something about post lists on how to search a query fast.
    In this chapter, the author imports a concept named "indexing granularity", which may cause some confusion because of the different size of the index. And then he tells that problem can be alleviated by using explicit or implicit search. Moreover, the author compares the differences between token and type. And the most duty of tokenization is to define which tokens are the most accurate ones.The author also introduces a very useful language identification way to sort documents. We can use word segmentation or some ways based on K -gram to identify the language and people's intention. Under some condition, we may use a way called "stop word" to wipe out some words may have no important meaning. Later, the author introduces token normalization and term normalization, which can vague some identical words. Also another way to help normalization is to realize some synonym words such as car vs automobile.Then the author posts two items named "stemming " and "lemmatization", which can simplify the computation complexity. In the end of the chapter, the author intersect  post list with skip pointers.And also the author gives a short brief introduction of positional postings and phrase queries.All of these can shorten the search complexity.
    In the third chapter of IIR, the author tells something related to dictionaries and tolerant retrieval. In the beginning of the chapter, the author develops data structure to help to search. And then a concept named "wildcard " is introduced. Later the author introduces some other forms of queries which may have spelling errors or phonetic vocabulary items.
    Specifically, at first the author tells two broad data structure solutions: search tree and hash coding. In general wildcard queries, the author introduces a kind of index named permuterm index. Moreover, two basic principles of spelling correction are introduced with K- gram. The author focuses on isolated term correction and context sensitive correction as well as two forms in isolated term correction:edit distance and K-gram overlap. Moreover, the author introduces soundex algorithm, which has many different forms.

No comments:

Post a Comment