Zheng Gao's Wonderland: 2014

Sunday, December 21, 2014

12.21 First day in Los Angles

Today I travelled through the whole America from the east to the west. It is so cold in Pittsburgh while quite warm in Los Angles.
This morning at 3:30 am, I got up and my roommate Chaoran Zhou drove me and Jiajie Dang to the Pittsburgh airport. I took the flight to Phenix at first and then took another flight from Phenix to Los Angles.
In 12:00 pm at local time, I arrived at Los Angles Airport. Another roommate of mine, Yide Fu waited me. We four people then met together finally.
The next important thing was to choose where and what to eat. As I was a fun of Running Man, a TV Korean program, we decided to eat Korean BBQ. It was a buffet and we ate from 1:00 pm to 3:00 pm.

After we finished our lunch, we went to UCLA, one of the most famous universities under UC system.

After the travel to UCLA. we met one friend of Yide, who is an Indian American. We talked in a bar and then went to a recreation center called Round1. We played bowling there. I got the third out of 5 people in the end.

Tuesday, April 15, 2014

Week 14: Reading Notes

1.Information Retrieval on the Semantic Web. In 10th International Conference on Information and Knowledge Management

One vision of the Semantic Web is that it will be much like the Web we know today, except that documents will be enriched by annotations in machine understandable markup. These annotations will provide metadata about the documents as well as machine interpret-able statements capturing some of the meaning of document content. We discuss how the information retrieval paradigm might be recast in such an environment. We suggest that retrieval can be tightly bound to inference. Doing so makes today’s Web search engines useful to Semantic Web inference engines, and causes improvements in either retrieval or inference to lead directly to improvements in the other.

2. Generalizing from relevance feedback using named entity wildcards

Traditional adaptive ﬁltering systems learn the user’s interests in a rather simple way – words from relevant documents are favored in the query model, while words from irrelevant documents are down-weighted. This biases the query model towards speciﬁc words seen in the past, causing the system to favor documents containing relevant but redundant information over documents that use previously unseen words to denote new facts about the same news event. This paper proposes news ways of generalizing from relevance feedback by augmenting the traditional bagof-words query model with named entity wildcards that are anchored in context. The use of wildcards allows generalization beyond speciﬁc words, while contextual restrictions limit the wildcard-matching to entities related to the user’s query. We test our new approach in a nuggetlevel adaptive ﬁltering system and evaluate it in terms of both relevance and novelty of the presented information. Our results indicate that higher recall is obtained when lexical terms are generalized using wildcards. However, such wildcards must be anchored to their context to maintain good precision. How the context of a wildcard is represented and matched against a given document also plays a crucial role in the performance of the retrieval system.

3.Learning to rank for information retrieval

The task of "learning to rank" has emerged as an active and growing area of research both in information retrieval and machine learning. The goal is to design and apply methods to automatically learn a function from training data, such that the function can sort objects (e.g., documents) according to their degrees of relevance, preference, or importance as defined in a specific application.The relevance of this task for IR is without question, because many IR problems are by nature ranking problems. Improved algorithms for learning ranking functions promise improved retrieval quality and less of a need for manual parameter adaptation. In this way, many IR technologies can be potentially enhanced by using learning to rank techniques.

Week 13: Muddiest Points

1. How to Perform Classification?
2. What is Naïve Bayes Classifier
3. How to compute MI values?

Tuesday, April 8, 2014

Week 13: Reading Notes

IIR Chapter 13:

We begin this chapter with a general introduction to the text classification problem including a formal definition (Section 13.1); we then cover Naive Bayes, a particularly simple and effective classificationmethod (Sections 13.2– 13.4). All of the classification algorithms we study represent documents in high-dimensional spaces. To improve the efficiency of these algorithms, it is generally desirable to reduce the dimensionality of these spaces; to this end, a technique known as feature selection is commonly applied in text classification as discussed in Section 13.5. Section 13.6 covers evaluation of text classification. In the following chapters, Chapters 14 and 15, we look at two other families of classification methods, vector space classifiers and support vector machines.

IIR Chapter 14:

There aremany classification tasks, in particular the type of text classification that we encountered in Chapter 13, where classes can be distinguished by word patterns. For example, documents in the class China tend to have high values on dimensions like Chinese, Beijing, and Maowhereas documents in the

class UK tend to have high values for London, British and Queen. Documents of the two classes therefore form distinct contiguous regions as shown in Figure 14.1 and we can drawboundaries that separate themand classify new documents. How exactly this is done is the topic of this chapter. Whether or not a set of documents is mapped into a contiguous region depends on the particular choices we make for the document representation: type of weighting, stop list etc. To see that the document representation is crucial, consider the two classes written by a group vs. written by a single person. Frequent occurrence of the first person pronoun I is evidence for the single-person class. But that information is likely deleted fromthe document

representation if we use a stop list. If the document representation chosen is unfavorable, the contiguity hypothesis will not hold and successful vector space classification is not possible.

IIR Chapter 16:

Clustering is the most common form of unsupervised learning. No super-vision means that there is no human expert who has assigned documentsto classes. In clustering, it is the distribution and makeup of the data that will determine cluster membership. A simple example is Figure 16.1. It is visually clear that there are three distinct clusters of points. This chapter and Chapter 17 introduce algorithms that find such clusters in an unsupervised fashion. The difference between clustering and classification may not seem great at first. After all, in both cases we have a partition of a set of documents into groups. But as we will see the two problems are fundamentally different. Classification is a form of supervised learning (Chapter 13, page 256): our goal is to replicate a categorical distinction that a human supervisor im-poses on the data. In unsupervised learning, of which clustering is the most important example, we have no such teacher to guide us. The key input to a clustering algorithm is the distance measure. In Figure 16.1, the distance measure is distance in the 2D plane. This measure suggests three different clusters in the figure. In document clustering, the distance measure is often also Euclidean distance. Different distance measures give rise to different clusterings. Thus, the distance measure is an important means by which we can influence the outcome of clustering.

IIR Chapter 17:

This chapter first introduces agglomerative hierarchical clustering (Section 17.1) and presents four different agglomerative algorithms, in Sections 17.2–17.4,which differ in the similarity measures they employ: single-link, completelink, group-average, and centroid similarity. We then discuss the optimality conditions of hierarchical clustering in Section 17.5. Section 17.6 introduces top-down (or divisive) hierarchical clustering. Section 17.7 looks at labeling clusters automatically, a problem that must be solved whenever humans interact with the output of clustering. We discuss implementation issues in Section 17.8. Section 17.9 provides pointers to further reading, including references to soft hierarchical clustering, which we do not cover in this book. There are few differences between the applications of flat and hierarchical clustering in information retrieval. In particular, hierarchical clustering is appropriate for any of the applications shown in Table 16.1 (page 351; see also Section 16.6, page 372). In fact, the example we gave for collection clustering is hierarchical. In general, we select flat clustering when efficiency is important and hierarchical clustering when one of the potential problems of flat clustering (not enough structure, predetermined number of clusters, non-determinism) is a concern. In addition,many researchers believe that hierarchical clustering produces better clusters than flat clustering. However, there is no consensus on this issue (see references in Section 17.9).

Week 12: Muddiest Points

1. What is Centroid-Based Classifier?
2. How an IR system actively selects documents for obtaining relevance judgments?

3. What is Personalized Search?

Monday, March 31, 2014

Week 12: Reading Notes

User Profiles for Personalized Information Access:

The amount of information available online is increasing exponentially. While this information is a valuable resource, its sheer volume limits its value. Many research projects and companies are exploring the use of personalized applications that manage this deluge by tailoring the information presented to individual users. These applications all need to gather, and exploit, some information about individuals in order to be effective. This area is broadly called user profiling. This chapter surveys some of the most popular techniques for collecting information about users, representing, and building user profiles. In particular, explicit information techniques are contrasted with implicitly collected user information using browser caches, proxy servers, browser agents, desktop agents, and search logs. We discuss in detail user profiles represented as weighted keywords, semantic networks, and weighted concepts. We review how each of these profiles is constructed and give examples of projects that employ each of these techniques. Finally, a brief discussion of the importance of privacy protection in profiling is presented.

Content-Based Recommendation Systems:

This chapter discusses content-based recommendation systems, i.e., systems that recommend an item to a user based upon a description of the item and a profile of the user’s interests. Content-based recommendation systems may be used in a variety of domains ranging from recommending web pages, news articles, restaurants, television programs, and items for sale. Although the details of various systems differ, content-based recommendation systems share in common a means for describing the items that may be recommended, a means for creating a profile of the user that describes the types of items the user likes, and a means of comparing items to the user profile to determine what to re commend. The profile is often created and updated automatically in response to feedback on the desirability of items that have been presented to the user.

Personalized web exploration with task models:

Personalized Web search has emerged as one of the hottest topics for both the Web industry and academic researchers. However, the majority of studies on personalized search focused on a rather simple type of search, which leaves an important research topic - the personalization in exploratory searches - as an under-studied area. In this paper, we present a study of personalization in task-based information exploration using a system called TaskSieve. TaskSieve is a Web search system that utilizes a relevance feedback based profile, called a "task model", for personalization. Its innovations include flexible and user controlled integration of queries and task models, task-infused text snippet generation, and on-screen visualization of task models. Through an empirical study using human subjects conducting task-based exploration searches, we demonstrate that TaskSieve pushes significantly more relevant documents to the top of search result lists as compared to a traditional search system. TaskSieve helps users select significantly more accurate information for their tasks, allows the users to do so with higher productivity, and is viewed more favorably by subjects under several usability related characteristics.

Week 11: Muddiest Points

1. What is the difference between parallel corpora and comparable corpora?

2. How to solve out-of-vocabulary term problem?

3. How to achieve parallel in IR?

Tuesday, March 25, 2014

Week 11: Reading Notes

Cross-Language Information Retrieval. Annual Review of Information Science and Technology

This chapter reviews research and practice in cross-language information retrieval (CUR) that seeks to support the process of finding documents written in one natural language (e.g., English or Portuguese) with automated systems that can accept queries expressed in other languages. With the globalization of the economy and the continued internationalization of the Internet, CUR is becoming an increasingly important capability that facilitates the effective exchange of information. For retrospective retrieval, CUR allows users to state questions in their native language and then retrieve documents in any supported language. This can simplify searching by multilingual users and, if translation resources are limited, can allow searchers to allocate those resources to the most promising documents. In selective dissemination applications, CUR allows monolingual users to specify a profile using words from one language and then use that profile to identify promising documents in many languages. Adaptive filtering systems that seek to learn profiles automatically can use CUR to process training documents that may not be in the same language as the documents that later must be selected.Cross-Language Information Retrieval. In Ayse Goker, John Davies, Margaret Graham (eds)

CLIR are available in most main search engine and it brings a great convenience for people to retrieve documents contains multiple language. The most common methods are that the query is translated into one language if the query contains multiple language. I am wondering how to choose the priority language. For example, if a query contains Japanese and English, the system should translate the Japanese into English or translate English into Japanese?
Moreover, in non-CLIR we use the Boolean query and index the term and find the match index. Why we cannot just simple index several language in different index and when we need to fulfill the query, we just retrieve the docs contains all or some the term in query in both language. Finally, I am wondering the market share for CLIR in IR market in US market. Because I think most often cross language query are query using English and other language. So we just need to focus on combining English with other Language.
From the beginning of this semester, I am wondering the tech for retrieve multimedia materials. After reading the materials, I got some basic ideas. Mate data plays a great position in multimedia search, since compare to multimedia, text are very simple and handful. But at the same time, I believe image recognition is very helpful for matching query text and the materials in the retrieve library. But for the index of multimedia are not as simple as the text, I am wondering how the index ordered, if the index contains pictures or videos.

IES chapter 14 parallel information retrieval

Information retrieval systems often have to deal with very large amounts of data. They must be able to process many gigabytes or even terabytes of text, and to build and maintain an index for millions of documents. To some extent the techniques discussed in Chapters 5–8 can help us satisfy these requirements, but it is clear that, at some point, sophisticated data structures and clever optimizations alone are not suﬃcient anymore. A single computer simply does not have the computational power or the storage capabilities required for indexing even a small fraction of the World Wide Web.1
In this chapter we examine various ways of making information retrieval systems scale to very large text collections such as the Web. The ﬁrst part (Section 14.1) is concerned with parallel query processing, where the search engine’s service rate is increased by having multiple index servers process incoming queries in parallel. It also discusses redundancy and fault tolerance issues in distributed search engines. In the second second part (Section 14.2), we shift our attention to the parallel execution of oﬀ-line tasks, such as index construction and statistical analysis of a corpus of text. We explain the basics of MapReduce, a framework designed for massively parallel computations carried out on large amounts of data.

Week 10: Muddiest Points

1. Why web query logs are so important?

2. What is the main function of crawler?

3. What is "The HITS algorithm: Grab pages"?

Monday, March 3, 2014

Week 10: Reading Notes

IIR Chapter 19:

In this and the following two chapters, we consider web search engines. Sections 19.1–19.4 provide some background and history to help the reader appreciate the forces that conspire to make the Web chaotic, fast-changing and (from the standpoint of information retrieval) very different from the “traditional” collections studied thus far in this book. Sections 19.5–19.6 deal with estimating the number of documents indexed byweb search engines, and the elimination of duplicate documents in web indexes, respectively. These two latter sections serve as background material for the following two chapters.
We can view the static Web consisting of static HTML pages together with the hyperlinks between them as a directed graph in which each web page is a node and each hyperlink a directed edge. Early in the history of web search, it became clear that web search engines were an important means for connecting advertisers to prospective buyers.
One aspect we have ignored in the discussion of index size in Section 19.5 is duplication: the Web contains multiple copies of the same content. By some estimates, as many as 40% of the pages on the Web are duplicates of other pages. Many of these are legitimate copies; for instance, certain information repositories are mirrored simply to provide redundancy and access reliability. Search engines try to avoid indexing multiple copies of the same content, to keep down storage and processing overheads

IIR Chapter 21:

The analysis of hyperlinks and the graph structure of the Web has been instrumental in the development of web search. In this chapterwe focus on the use of hyperlinks for ranking web search results. Such link analysis is one of many factors considered by web search engines in computing a composite score for a web page on any given query. We begin by reviewing some basics of the Web as a graph in Section 21.1, then proceed to the technical development of the elements of link analysis for ranking. Link analysis for web search has intellectual antecedents in the field of citation analysis, aspects of which overlap with an area known as bibliometrics. These disciplines seek to quantify the influence of scholarly articles by analyzing the pattern of citations amongst them. Much as citations represent the conferral of authority from a scholarly article to others, link analysis on the Web treats hyperlinks from a web page to another as a conferral of authority. Clearly, not every citation or hyperlink implies such authority conferral; for this reason, simply measuring the quality of a web page by the number of in-links (citations from other pages) is not robust enough. For instance, one may contrive to set up multiple web pages pointing to a target web page, with the intent of artificially boosting the latter’s tally of in-links. This phenomenon is referred to as link spam. Nevertheless, the phenomenon of citation is prevalent and dependable enough that it is feasible for web search engines to derive useful signals for ranking from more sophisticated link analysis. Link analysis also proves to be a useful indicator of what page(s) to crawl next while crawling the web; this is done by using link analysis to guide the priority assignment in the front queues of Chapter 20. Section 21.1 develops the basic ideas underlying the use of the web graph in link analysis. Sections 21.2 and 21.3 then develop two distinct methods for link analysis, PageRank and HITS.

Authoritative sources in a hyperlinked environment:

The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have e ective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their e ectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of \authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of \hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.
The further development of link-based methods to handle information needs otherthan broad-topic queries on the www poses many interesting challenges. As noted above, work has been done on the incorporation of textual content into our framework as a way of \focusing" a broad-topic search [6, 10, 11], but one can ask what other basic informational structures one can identify, beyond hubs and authorities, from the link topology of hypermedia such as the www. The means by which interaction with a link structure can facilitate.the discovery of information is a general and far-reaching notion, and we feel that it will continue to o er a range of fascinating algorithmic possibilities.

The Anatomy of a Large-Scale Hypertextual Web Search Engine:

In this paper, we present Google, a prototype of a large-scale search engine which makes heavyuse of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three
years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.

Week 8: Muddiest Points

Tuesday, February 25, 2014

Week 8: Reading Notes

Marti A. Hearst Chapter 1:
(http://searchuserinterfaces.com/book/sui_ch1_design.html)

This chapter has introduced the ideas and practices surrounding user interface design in general, and search interface design in particular. It has explained some of the difficulties with search interface design and provided a set of design guidelines tailored specifically to search user interfaces. These guidelines include:

Offer efficient and informative feedback,
Balance user control with automated actions,
Reduce short-term memory load,
Provide shortcuts,
Reduce errors,
Recognize the importance of small details, and
Recognize the importance of aesthetics.

This chapter has also summarized some of the most successful design ideas that are commonly in use in search interfaces today. This summary is based on generalizing over the results of years of research, experimentation, and tests in the marketplace. The coming years should reveal additional new, exciting ideas that will become reliable standards for search user interfaces.

Marti A. Hearst Chapter 11:

(http://searchuserinterfaces.com/book/sui_ch11_text_analysis_visualization.html)

Visualization is a promising tool for the analysis and understanding of text collections, including semi-structured text as found in citation collections, and for applications such as literary analysis. Although not shown in this chapter, visualization has also been applied to online conversations and other forms of social interaction which have textual components. With the advent of social visualization Web sites like IBM's manyeyes.com, and other tools that continue to make visualization generally accessible to users who are not programmers, it is likely that the use of visualization for analysis of text will only continue to grow in popularity.

IIR Chapter 10:

Information retrieval systems are often contrasted with relational databases.Traditionally, IR systems have retrieved information from unstructured text– by which we mean “raw” text without markup. Databases are designed for querying relational data.
  After presenting the basic concepts of XML in Section 10.1, this chapter first discusses the challengeswe face in XML retrieval (Section 10.2). Next we describe a vector space model for XML retrieval (Section 10.3). Section 10.4 presents INEX, a shared task evaluation that has been held for a number of years and currently is the most important venue for XML retrieval research. We discuss the differences between data centric and text-centric approaches to XML in Section 10.5.
  An XML document is an ordered, labeled tree. Each node of the tree is an XML element and is written with an opening and closing tag. An element can have one or more XML attributes.
  The premier venue for research on XML retrieval is the INEX (INitiative for the Evaluation of XML retrieval) program, a collaborative effort that has produced reference collections, sets of queries, and relevance judgments.

Monday, February 24, 2014

Week 7: Muddiest Points

1. Why we should have the concept NDCG? I think DCG and IDCG is enough,

2. If two system have the same average significance but different standard deviation, which one is better? Bigger standard deviation one or the smaller standard deviation one?

3. How to make RP useful to the user?

Wednesday, February 19, 2014

Week 7: Reading Notes

IIR Chapter 9
In this chapter the author mainly talks about relevance feedback and query expansion. The same concept may be referred to using different words. This issue, known as synonymy, has an impact on the recall of most
information retrieval systems.in this chapter we discuss ways in which a system can help with query refinement, either fully automatically or with the user in the loop.
The methods for tackling this problem split into two major classes: global methods and local methods. Global methods are techniques for expanding or reformulating query terms independent of the query and results returned from it, so that changes in the query wording will cause the new query to match other semantically similar terms.
The idea of relevance feedback (RF) is to involve the user in the retrieval process so as to improve the final result set. The Rocchio Algorithm is the classic algorithm for implementing relevance feedback. It models a way of incorporating relevance feedback information into the vector space model.
In the next section we more briefly discuss three global methods for expanding a query: by simply aiding the user in doing so, by using a manual thesaurus, and through building a thesaurus automatically.
Improving the effectiveness of information retrieval with local context analysis
Techniques for automatic query expansion have been extensively studied in information research as a means of addressing the word mismatch between queries and documents. These techniques can be categorized as either global or local. While global techniques rely on analysis of a whole collection to discover word relationships, local techniques emphasize analysis of the top-ranked documents retrieved for a query. While local techniques have shown to be more effective that global techniques in general, existing local techniques are not robust and can seriously hurt retrieved when few of the retrieval documents are relevant. We propose a new technique, called local context analysis, which selects expansion terms based on co-occurrence with the query terms within the top-ranked documents. Experiments on a number of collections, both English and non-English, show that local context analysis offers more effective and consistent retrieval results.
A study of methods for negative relevance feedback.
Negative relevance feedback is a special case of relevance feedback where we do not have any positive example; this often happens when the topic is difficult and the search results are poor. Although in principle any standard relevance feedback technique can be applied to negative relevance feedback, it may not perform well due to the lack of positive examples. In this paper, we conduct a systematic study of methods for negative relevance feedback. We compare a set of representative negative feedback methods, covering vector-space models and language models, as well as several special heuristics for negative feedback. Evaluating negative feedback methods requires a test set with sufficient difficult topics, but there are not many naturally difficult topics in the existing test collections. We use two sampling strategies to adapt a test collection with easy topics to evaluate negative feedback. Experiment results on several TREC collections show that language model based negative feedback methods are generally more effective than those based on vector-space models, and using multiple negative models is an effective heuristic for negative feedback. Our results also show that it is feasible to adapt test collections with easy topics for evaluating negative feedback methods through sampling.
Relevance feedback revisited
Researchers have found relevance feedback to be effective in interactive information retrieval, although few formal user experiments have been made. In order to run a user experiment on a large document collection, experiments were performed at NIST to complete some of the missing links found in using the probabilistic retrieval model. These experiments, using the Cranfield 1400 collection, showed the importance of query expansion in addition to query reweighting, and showed that adding as few as 20 well-selected terms could result in performance improvements of over 100%. Additionally it was shown that performing multiple iterations of feedback is highly effective.

Week 6: Muddiest Points

1. How does pooling work?
2.I don't quite understand the part of Kappa.
3.Which type of searches is MRR a good measure?

Thursday, February 13, 2014

Week 6: Reading Notes

IIR Chapter 8：

In this chapter, the author mainly talks about the evaluation in information retrieval. Because information retrieval has developed as a highly empirical discipline, requiring careful and thorough evaluation to demonstrate the superior performance of novel techniques on representative document collections.
In this chapter the author begins with a discussion of measuring the effectiveness of IR systems and the test collections that are most often used for this purpose. He then present the straightforward notion of relevant and nonrelevant documents and the formal evaluation methodology that has been developed for evaluating unranked retrieval results.He then extends these notions and develop further measures for evaluating ranked retrieval results. He then steps back to introduce the notion of user utility, and how it is approximated by the use of document relevance.The author also tells a misundestanding part that user perceptions do not always coincide with system designers’ notions of quality.
At first the author tells a concept named test collection, which contains three different parts.Then he also tells that relevance is assessed relative to an information need, not a query.In the next section, the author gives some standard test collections, which includes Cranfield collection,Text Retrieval Conference (TREC),NII Test Collections for IR Systems (NTCIR),GOV2,CLEF,etc.
In the next section ,the author introduces a concept named contingency table and draw a table to illustrate it.

And he list the equation:accuracy =(tp + tn)/(tp + f p + f n + tn). However, the author says this equation is not that useful. It may lead to rate of false positive. He then claims to use both precision and recall because the advantage of having the two numbers for precision and recall is that one is more important than the other in many circumstances.In the final analysis, the success of an IR system depends on how good it is at satisfying the needs of these idiosyncratic humans.
In the last few sections, the author focuses on the evaluation with a broader perspective.He focuses on user utility,refining a deployed system and system issues.

What's the value of TREC: is there a gap to jump or a chasm to bridge?

The TREC Programme has been very successful at generalising. It has shown that essentially simple methods of retrieving documents.The TREC Programme hassought to address variation, but it has done this in a largely ad hoc and unsystematic way.
The author's case is based on the notion of micro variation, and on the distinction between system environment and task context. He uses the evaluation framework ideas to analyse the TREC experimental programme and to support my argument for a new direction for TREC.
A convenient way of summarising the Cran eld evaluation paradigm is in terms of environment variables and system parameters.
In general, TREC participants have sought to adapt, or extend, their existing system apparatus to the new environment variable values.The foregoing is only an informal discussion: more thorough analysis of retrieval contexts is needed for a factor characterisation to be taken seriously as a basis for system development.

Cumulated gain-based evaluation of IR techniques ACM Transactions on Information Systems
Modern large retrieval environments tend to overwhelm their users by their large output.In order to develop IR techniques in this direction, it is necessary to develop evaluation approaches and methods that credit IR methods for their ability to retrieve highly relevant documents.
Graded relevance judgments may be used for IR evaluation, first, by extending traditional evaluation measures, such as recall and precision and P–R curves, to use them.
The author demonstrates the use of the proposed measures in a case study testing runs from the TREC-7 ad hoc track with binary and nonbinary relevance judgments.
In modern large database environments, the development and evaluation of IR methods should be based on their ability to retrieve highly relevant documents. This is often desirable from the user viewpoint and presents a not too liberal test for IR techniques.

Week 5: Muddiest Points

1.Why unigram language model is more popular than comparing language model?

2.As for higher order models, which one is the best among n-gram, cache or grammer?
3.I still don't quite understand the use of smoothing.
4. Language model is based on vector space model. right?

Wednesday, February 5, 2014

Week 5: Reading Notes

IIR Chapter 11:
In this chapter, the author mainly talks about using probabilities in information retrieval. And he also introduces several ways of probabilistic information retrieval. Users start with information needs, which they translate into query representations. And now,there are documents that are converted into document
representations.
In this chapter, the author at first introduce some basic knowledge of probability, which most of us have already learned during our high school study.then the author concentrates on the Binary Independence
Model, which is the original and still most influential probabilistic retrieval model. Finally, we will introduce related but extended methods which use term counts, including the empirically successful Okapi BM25 weighting scheme, and Bayesian Network models for IR.Those are all in probabilistic area. Which contains some mathematics questions and explanations.The author introduces partition rule and Bayes’ Rule, which is

At last, the author mentions a concept named ODDS. the odds of an event is to provide a kind of multiplier for how probabilities change.

In another section, the author introduces The 1/0 loss case,and a concept: P(R = 1|d, q). This is the basis of the Probability Ranking Principle.If a set of retrieval results is to be returned, rather than an ordering, the Bayes Optimal Decision Rule, the decision which minimizes the risk of loss, is to simply return documents that are more likely relevant than nonrelevant. the Probability Ranking Principle says that if for a specific document d and for all documents d′ not yet retrieved

then d is the next document to be retrieved.

The next section is about The Binary Independence Model.The author makes the conditional independence assumption that the presence or absence of aword in a document is independent of the presence or absence of any other word. In the end,The resulting quantity used for ranking is called the Value RETRIEVAL STATUS (RSV) in this model:

In the next part, the author introduces Probability estimates in theory.To avoid the possibility of zeroes (such as if every or no relevant document has a particular term) it is fairly standard to add 12 to each of the quantities.This is referred to as the relative frequency of the event.Estimating the probability as the relative frequency is the maximum likelihood estimate (or MLE),because this value makes the observed data maximally likely. And then the author tells a concept named Bayesian prior,This is a form of maximum a posterior (MAP) MAXIMUM A estimation, where we choose the most likely point value for probabilities based on the prior and the observed evidence.

In the part of Probability estimates in practice, the author lists that Croft and Harper (1979) proposed using a constant in their combination match model.Moreover, We can use (pseudo-)relevance feedback, perhaps in an iterative process of estimation, to get a more accurate estimate of pt.

Probabilistic methods are one of the oldest formal models in IR. In the Tree-structured dependencies between terms,Some of the assumptions of the BIM can be removed.

IIR Chapter 12:

This chapter is mainly about Language models for information retrieval.Instead of overtly modeling the probability P which is talked about in chapter 11, the 12th chapter has the basic language modeling approach instead builds a probabilistic language model Md from each document d, and ranks documents based on the probability of the model generating the query: P(q|Md).In the first part of the chapter, the author first introduce the concept of language models,and then talks about the Query Likelihood Model. In the end, the author also tells something about various extensions to the language modeling approach.

At the beginning of this chapter, the author introduces Finite automata and language models. A traditional

generative model of a language, of the kind familiar from formal language theory, can be used either to recognize or to generate strings. And A language model is a function that puts a probability measure over strings drawn from some vocabulary. As for the types of the language models, The simplest form of language model simply throws away all conditioning context, and estimates each term independently,which we called it unigram language model. In the end of the section, the professor tells us"The strategy we adopt in IR is as follows. We pretend that the document d is only a representative sample of text drawn from a model distribution, treating it like a fine-grained topic. We then estimate a language model from this sample, and use that model to calculate the probability of observing any word sequence, and, finally,we rank documents according to their probability of generating the query".That is the strategy of language models in IR.

In the next several sections, the author tells us a lot kinds of language models. Language modeling is a quite general formal approach to IR,with many variant realizations. The original and basic method for using language models in IR is the query likelihood model. The most common way to do this is using the multinomial unigram languagemodel, which is equivalent to amultinomial Naive Bayesmodel (page 263),

where the documents are the classes, each treated in the estimation as a separate “language”.

In the end, the author concludes that the retrieval ranking for a query q under the basic LM for

IR he has

been considering is given by:

The next section is about the comparison between Language modeling and other approaches in IR.Compared to other probabilistic approaches, such as the BIM from Chapter 11, the main difference initially appears to be that the LM approach does away with explicitly modeling relevance (whereas this is the central variable evaluated in the BIM approach).The model has significant relations to traditional tf-idf models. Also the professor lists three ways of developing the language modeling approach: query likelihood, document likelihood, and model comparison.

The Paper:

This paper is mainly talks about the comparison between the traditional IR model and the new language model in IR. There are three traditional IR model: the boolean model, the vector model and the probabilistic model.And the new language model is based on the vector model in tf, idf terms and the probabilistic model in relevance weighting. The vector model and the probabilistic model stands for different approaches to information retrieval. The former is based on the similarity between query and document, the latter is based on the probability of relevance, using the distribution of terms over relevant and non-relevant documents.However, the author finds some interesting things in language models.And he presents a strong theoretical motivation of the language modelling approach and shows that the approach outperforms the weighting algorithms developed within traditional models.

After discussing some traditional models' features, the author begins to introduce statistical language model of retrieval.The author uses urn model as a metaphor to illustrate language tools. And then he also introduces ad-hoc retrieval task. He uses the traditional models as the foundation and gives a definition of the corresponding probability measures. Also he gives some parameter estimation by using tf and idf. In the end, he shows the results of the ad hoc task. The results shows that both the original probabilistic model and the original vector space model underperform on this task.And the language model shares some same features with the traditional models. The paper in the end introduces new ways of thinking about two popular information retrieval tools: the use of stop words and the use of a stemmer.

Thursday, January 30, 2014

Week 4: Muddiest Points

Pre- reading:

The ways to calculate weighting is quite complex and I can't understand it well.
Muddiest Points:
1.What is the weakness of boolean match model?
2.I don't quite understand tf-idf weighting.
3.The calculation in the last few slides is a little difficult for me.

Week 4: Reading Notes

IIR Section 1.3,1.4:

In this section, the author introduces how to use inverted index and basic Boolean retrieval model.He tells the concept of intersection.
Simple strict Boolean operation can not satisfy information needs. So a proximity operator is introduced. It is a way of specifying that two terms in a query must occur close to each other in a document. Then the author uses an example of Westlaw to illustrate that Boolean searching is still a good way to use in commercial service. Boolean searching is more accurate and can be more powerful.
IIR Chapter 6:
This chapter is about scoring, term weighting and the vector space model. The author talks about parametric and zone indexes at first. Then he develops the idea of weighting the importance of a
term in a document. The author also tells vector space scoring and how to compute the view into vector.At last he develops several variants of term-weighting.
Digital documents generally encode, in machine-recognizable form, certain metadata associated with each document. Zones are similar to fields, except the contents of a zone can be arbitrary free text. The author also tells a concept named weighted zone scoring. He also tells some ways of weighting. For example Tf-idf weighting. It is quite complex.

Thursday, January 23, 2014

Week 3: Muddiest Points

Pre-reading:

I don't understand the principle of Variable byte (VB) encoding and γ codes quite well.
Muddiest Points:
1. I still don't understand how edited distance works well. I only have a overview of this concept.
2. I think all that professor mentioned in class are clear enough to understand.

Tuesday, January 14, 2014

Week 3: Reading Notes

IIR Chapter 4:

In this chapter, the author focus on index construction, especially invert index.Then a blocked sort-based index was introduced. Distributed index and dynamic index are also introduced in this chapter.

Hardware is the basic character when people build an information retrieval system. Operating systems generally read and write entire blocks. Buffer is the place to read and write blocks in main memory.Because of the limitation of main memory, we need to use external sorting algorithm.Blocked sort-based indexing is an efficient way to raise up the retrieval speed. Then the author illustrate how BSBI process information.

However, BSBI is not suitable for very large collections, then a method called single-pass in-memory indexing is introduced.A difference between BSBI and SPIMI is that SPIMI adds a posting directly to its postings list. Instead of first collecting all termID–docID pairs and then sorting them(as we did in BSBI), each postings list is dynamic (i.e., its size is adjusted as it grows) and it is immediately available to collect postings.

Moreover, because the collection are always so large, a single machine can't perform index construction efficiently. So we begin to use distributed indexing. The author also introduce some basic parts about MapReduce, a general architecture for distributed indexing. MapReduce uses node to deal indexing with key-value pair by "map" and "reduce" process.
In dynamic indexing, a good solution is to have a large main index and a small auxiliary index stored in memory.But having multiple indexes complicates the maintenance of collection-wide statistics.

IIR Chapter 5:
In the chapter, the author introduces a number of compression techniques for dictionary and inverted index that are essential for efficient IR systems. It increases the using of caching and makes the data transfer from disk to memory more faster.At first, the author gives us the characteristics of the distribution of entities that we want to compress. Then the author gives us some compression technologies for dictionary and invert indexing.
At the beginning , a rule of 30 is introduced. In addition, the author claims that all compression technologies introduced here are lossless compression.The author also introduces a law named Heaps' law to estimate vocabulary size as a function of collection size: M = kTb. Moreover, A commonly used model of the distribution of terms in a collection is Zipf’s law.
Dictionary as a string can save almost 60% space. And blocked storage also can save much space in memory.
Variable byte (VB) encoding uses an integral number of bytes to encode a gap. γ codes is another way to compression. γ codes implement variable-length encoding by splitting the representation of a gap G into a pair of length and offset.

Friday, January 10, 2014

Week 2: Muddiest Points

I don't understand some content in Chapter 3 in IIR. What is "edit distance" in "spelling correction" and how to calculate it ? What's more, what is K- gram? Waiting for answer.
I don't understand the four Indri examples of query language
I don't understand why these two sentence are right：

1）Stemming never lowers recall(coverage);
2）Stemming should be used at indexing, but not at query processing.

Wednesday, January 8, 2014

Week 2: Reading Notes

IIR Section 1.2, Chapter 2, Chapter 3:
Firstly, the author introduces a concept named inverted index. At first the documents are divided into tokens. And then regrouped and gathered into a form of inverted index. There are two ways of storage: single linked list and variable length array. By using these two ways, it can make the system more efficient and save time.
In the part of "The term vocabulary and posting lists", the author reviews some major steps in inverted index construction. Also at the beginning of the chapter, the basic components of documents and how to make sure the sequence character are introduced. Then the article tells tokenization and how to build index to find the most accurate items in the most efficient way. In the last two parts of the second chapter, the author tells something about post lists on how to search a query fast.
In this chapter, the author imports a concept named "indexing granularity", which may cause some confusion because of the different size of the index. And then he tells that problem can be alleviated by using explicit or implicit search. Moreover, the author compares the differences between token and type. And the most duty of tokenization is to define which tokens are the most accurate ones.The author also introduces a very useful language identification way to sort documents. We can use word segmentation or some ways based on K -gram to identify the language and people's intention. Under some condition, we may use a way called "stop word" to wipe out some words may have no important meaning. Later, the author introduces token normalization and term normalization, which can vague some identical words. Also another way to help normalization is to realize some synonym words such as car vs automobile.Then the author posts two items named "stemming " and "lemmatization", which can simplify the computation complexity. In the end of the chapter, the author intersect post list with skip pointers.And also the author gives a short brief introduction of positional postings and phrase queries.All of these can shorten the search complexity.
In the third chapter of IIR, the author tells something related to dictionaries and tolerant retrieval. In the beginning of the chapter, the author develops data structure to help to search. And then a concept named "wildcard " is introduced. Later the author introduces some other forms of queries which may have spelling errors or phonetic vocabulary items.
Specifically, at first the author tells two broad data structure solutions: search tree and hash coding. In general wildcard queries, the author introduces a kind of index named permuterm index. Moreover, two basic principles of spelling correction are introduced with K- gram. The author focuses on isolated term correction and context sensitive correction as well as two forms in isolated term correction:edit distance and K-gram overlap. Moreover, the author introduces soundex algorithm, which has many different forms.

Week 1: Muddiest Points

1. I forgot what the two pictures stand for?

Unstructured (text) vs. Structured (database) Information in 1996

Unstructured (text) vs. Structured (database) Information in 2009

Week 1: Reading Notes

FOA 1.1:

FOA is a cognitive activity. It concerns the meaning instead of the detailed words. In the article, the author divided the process of FOA into three steps. What's more, by using some pictures, the meaning of the three steps are introduced vividly.

Specifically, the three phases are :1. asking a question; 2. constructing an answer; 3. assessing the answer. The phases mean that at first people raise some questions, and then these questions form searching query and are transferred search engine. Then the search engine use web crawler to get some related answers throughout huge document corpus. In the last step, people generate some relevance feedback to show which part is useful, which part is irrelevant and which part is neutral.

IES 1.1-1.2:

The article has a brief introduction to information retrieval. It includes an introduction to different kinds of search engines (which includes web search, desktop and file system, enterprise-level IR system, digital libraries and other specialized IR systems) and also shows us the components of an IR system.

And then, in the latter paragraph, the author shows something related to ranking algorithm. The author also tells two important principle when measuring the IR system which are efficiency and effectiveness. Only when searching in an effective and efficient way, we can get what we want in the shortest time. And then the author raises a principle named PRP(Probability Ranking Principle). In the principle the author tells that if the results after we searched are in the rank of decreasing probability of relevance, the effectiveness is the max. By using some examples and listing some related words, the author tells us how to search in an efficient way. Moreover, the author uses a small paragraph to show a concept of "document" and how documents are updated. I think this concept is more useful in later chapters.

As for me, in this chapter, I am most interested in "web search". I think it is quite amazing that how web search works. In the article the author says people stores a "snapshot" of the web in order to produce accurate result and minimize the reaction time. To update these snapshots, they use a web crawler to download the updates periodically.

MIR 1.1-1.4:

In the first chapter in this book, the author mostly introduce some basic information about the information retrieval. He firstly tells the concept of information retrieval, which is a way to help people get easy access to information of their interests. And he tells early development of IR. At first IR technology was only used in libraries. And with the help of the introduction of the World Wide Web, IR has finally had a place in the center of the stage.

The author also listed some problems of IR. If people type too many words to search, it may confuse the search engine. And then search engine can't generate related key words so that they can't get the answers that people want. Moreover, the author distinguishes information and data. Information allows small difference but data need to be totally accurate.

In the chapter of The IR System, the author tells us the inner construction of the IR system, which is the software architecture. The author uses two pictures to illustrate the different levels in IR system and how to generate index we want. Through different layers of process, we can get the top ranking retrieval answers.

Finally the author introduce the concept "web". He uses Jane Austen's example to illustrate the importance of web that can help people free to publish their ideas and works. And in the last, the author lists five impact of the web that can change search derives. Those impacts are the two sides of the coin. They offer the chance to search engine to develop as well as bring some negative results and confuse human normal life. However, it can help search derives to prosper.