Zheng Gao's Wonderland: January 2014

Thursday, January 30, 2014

Week 4: Muddiest Points

Pre- reading:

The ways to calculate weighting is quite complex and I can't understand it well.
Muddiest Points:
1.What is the weakness of boolean match model?
2.I don't quite understand tf-idf weighting.
3.The calculation in the last few slides is a little difficult for me.

Week 4: Reading Notes

IIR Section 1.3,1.4:

In this section, the author introduces how to use inverted index and basic Boolean retrieval model.He tells the concept of intersection.
Simple strict Boolean operation can not satisfy information needs. So a proximity operator is introduced. It is a way of specifying that two terms in a query must occur close to each other in a document. Then the author uses an example of Westlaw to illustrate that Boolean searching is still a good way to use in commercial service. Boolean searching is more accurate and can be more powerful.
IIR Chapter 6:
This chapter is about scoring, term weighting and the vector space model. The author talks about parametric and zone indexes at first. Then he develops the idea of weighting the importance of a
term in a document. The author also tells vector space scoring and how to compute the view into vector.At last he develops several variants of term-weighting.
Digital documents generally encode, in machine-recognizable form, certain metadata associated with each document. Zones are similar to fields, except the contents of a zone can be arbitrary free text. The author also tells a concept named weighted zone scoring. He also tells some ways of weighting. For example Tf-idf weighting. It is quite complex.

Thursday, January 23, 2014

Week 3: Muddiest Points

Pre-reading:

I don't understand the principle of Variable byte (VB) encoding and γ codes quite well.
Muddiest Points:
1. I still don't understand how edited distance works well. I only have a overview of this concept.
2. I think all that professor mentioned in class are clear enough to understand.

Tuesday, January 14, 2014

Week 3: Reading Notes

IIR Chapter 4:

In this chapter, the author focus on index construction, especially invert index.Then a blocked sort-based index was introduced. Distributed index and dynamic index are also introduced in this chapter.

Hardware is the basic character when people build an information retrieval system. Operating systems generally read and write entire blocks. Buffer is the place to read and write blocks in main memory.Because of the limitation of main memory, we need to use external sorting algorithm.Blocked sort-based indexing is an efficient way to raise up the retrieval speed. Then the author illustrate how BSBI process information.

However, BSBI is not suitable for very large collections, then a method called single-pass in-memory indexing is introduced.A difference between BSBI and SPIMI is that SPIMI adds a posting directly to its postings list. Instead of first collecting all termID–docID pairs and then sorting them(as we did in BSBI), each postings list is dynamic (i.e., its size is adjusted as it grows) and it is immediately available to collect postings.

Moreover, because the collection are always so large, a single machine can't perform index construction efficiently. So we begin to use distributed indexing. The author also introduce some basic parts about MapReduce, a general architecture for distributed indexing. MapReduce uses node to deal indexing with key-value pair by "map" and "reduce" process.
In dynamic indexing, a good solution is to have a large main index and a small auxiliary index stored in memory.But having multiple indexes complicates the maintenance of collection-wide statistics.

IIR Chapter 5:
In the chapter, the author introduces a number of compression techniques for dictionary and inverted index that are essential for efficient IR systems. It increases the using of caching and makes the data transfer from disk to memory more faster.At first, the author gives us the characteristics of the distribution of entities that we want to compress. Then the author gives us some compression technologies for dictionary and invert indexing.
At the beginning , a rule of 30 is introduced. In addition, the author claims that all compression technologies introduced here are lossless compression.The author also introduces a law named Heaps' law to estimate vocabulary size as a function of collection size: M = kTb. Moreover, A commonly used model of the distribution of terms in a collection is Zipf’s law.
Dictionary as a string can save almost 60% space. And blocked storage also can save much space in memory.
Variable byte (VB) encoding uses an integral number of bytes to encode a gap. γ codes is another way to compression. γ codes implement variable-length encoding by splitting the representation of a gap G into a pair of length and offset.

Friday, January 10, 2014

Week 2: Muddiest Points

I don't understand some content in Chapter 3 in IIR. What is "edit distance" in "spelling correction" and how to calculate it ? What's more, what is K- gram? Waiting for answer.
I don't understand the four Indri examples of query language
I don't understand why these two sentence are right：

1）Stemming never lowers recall(coverage);
2）Stemming should be used at indexing, but not at query processing.

Wednesday, January 8, 2014

Week 2: Reading Notes

IIR Section 1.2, Chapter 2, Chapter 3:
Firstly, the author introduces a concept named inverted index. At first the documents are divided into tokens. And then regrouped and gathered into a form of inverted index. There are two ways of storage: single linked list and variable length array. By using these two ways, it can make the system more efficient and save time.
In the part of "The term vocabulary and posting lists", the author reviews some major steps in inverted index construction. Also at the beginning of the chapter, the basic components of documents and how to make sure the sequence character are introduced. Then the article tells tokenization and how to build index to find the most accurate items in the most efficient way. In the last two parts of the second chapter, the author tells something about post lists on how to search a query fast.
In this chapter, the author imports a concept named "indexing granularity", which may cause some confusion because of the different size of the index. And then he tells that problem can be alleviated by using explicit or implicit search. Moreover, the author compares the differences between token and type. And the most duty of tokenization is to define which tokens are the most accurate ones.The author also introduces a very useful language identification way to sort documents. We can use word segmentation or some ways based on K -gram to identify the language and people's intention. Under some condition, we may use a way called "stop word" to wipe out some words may have no important meaning. Later, the author introduces token normalization and term normalization, which can vague some identical words. Also another way to help normalization is to realize some synonym words such as car vs automobile.Then the author posts two items named "stemming " and "lemmatization", which can simplify the computation complexity. In the end of the chapter, the author intersect post list with skip pointers.And also the author gives a short brief introduction of positional postings and phrase queries.All of these can shorten the search complexity.
In the third chapter of IIR, the author tells something related to dictionaries and tolerant retrieval. In the beginning of the chapter, the author develops data structure to help to search. And then a concept named "wildcard " is introduced. Later the author introduces some other forms of queries which may have spelling errors or phonetic vocabulary items.
Specifically, at first the author tells two broad data structure solutions: search tree and hash coding. In general wildcard queries, the author introduces a kind of index named permuterm index. Moreover, two basic principles of spelling correction are introduced with K- gram. The author focuses on isolated term correction and context sensitive correction as well as two forms in isolated term correction:edit distance and K-gram overlap. Moreover, the author introduces soundex algorithm, which has many different forms.

Week 1: Muddiest Points

1. I forgot what the two pictures stand for?

Unstructured (text) vs. Structured (database) Information in 1996

Unstructured (text) vs. Structured (database) Information in 2009

Week 1: Reading Notes

FOA 1.1:

FOA is a cognitive activity. It concerns the meaning instead of the detailed words. In the article, the author divided the process of FOA into three steps. What's more, by using some pictures, the meaning of the three steps are introduced vividly.

Specifically, the three phases are :1. asking a question; 2. constructing an answer; 3. assessing the answer. The phases mean that at first people raise some questions, and then these questions form searching query and are transferred search engine. Then the search engine use web crawler to get some related answers throughout huge document corpus. In the last step, people generate some relevance feedback to show which part is useful, which part is irrelevant and which part is neutral.

IES 1.1-1.2:

The article has a brief introduction to information retrieval. It includes an introduction to different kinds of search engines (which includes web search, desktop and file system, enterprise-level IR system, digital libraries and other specialized IR systems) and also shows us the components of an IR system.

And then, in the latter paragraph, the author shows something related to ranking algorithm. The author also tells two important principle when measuring the IR system which are efficiency and effectiveness. Only when searching in an effective and efficient way, we can get what we want in the shortest time. And then the author raises a principle named PRP(Probability Ranking Principle). In the principle the author tells that if the results after we searched are in the rank of decreasing probability of relevance, the effectiveness is the max. By using some examples and listing some related words, the author tells us how to search in an efficient way. Moreover, the author uses a small paragraph to show a concept of "document" and how documents are updated. I think this concept is more useful in later chapters.

As for me, in this chapter, I am most interested in "web search". I think it is quite amazing that how web search works. In the article the author says people stores a "snapshot" of the web in order to produce accurate result and minimize the reaction time. To update these snapshots, they use a web crawler to download the updates periodically.

MIR 1.1-1.4:

In the first chapter in this book, the author mostly introduce some basic information about the information retrieval. He firstly tells the concept of information retrieval, which is a way to help people get easy access to information of their interests. And he tells early development of IR. At first IR technology was only used in libraries. And with the help of the introduction of the World Wide Web, IR has finally had a place in the center of the stage.

The author also listed some problems of IR. If people type too many words to search, it may confuse the search engine. And then search engine can't generate related key words so that they can't get the answers that people want. Moreover, the author distinguishes information and data. Information allows small difference but data need to be totally accurate.

In the chapter of The IR System, the author tells us the inner construction of the IR system, which is the software architecture. The author uses two pictures to illustrate the different levels in IR system and how to generate index we want. Through different layers of process, we can get the top ranking retrieval answers.

Finally the author introduce the concept "web". He uses Jane Austen's example to illustrate the importance of web that can help people free to publish their ideas and works. And in the last, the author lists five impact of the web that can change search derives. Those impacts are the two sides of the coin. They offer the chance to search engine to develop as well as bring some negative results and confuse human normal life. However, it can help search derives to prosper.