Zheng Gao's Wonderland: March 2014

Monday, March 31, 2014

Week 12: Reading Notes

User Profiles for Personalized Information Access:

The amount of information available online is increasing exponentially. While this information is a valuable resource, its sheer volume limits its value. Many research projects and companies are exploring the use of personalized applications that manage this deluge by tailoring the information presented to individual users. These applications all need to gather, and exploit, some information about individuals in order to be effective. This area is broadly called user profiling. This chapter surveys some of the most popular techniques for collecting information about users, representing, and building user profiles. In particular, explicit information techniques are contrasted with implicitly collected user information using browser caches, proxy servers, browser agents, desktop agents, and search logs. We discuss in detail user profiles represented as weighted keywords, semantic networks, and weighted concepts. We review how each of these profiles is constructed and give examples of projects that employ each of these techniques. Finally, a brief discussion of the importance of privacy protection in profiling is presented.

Content-Based Recommendation Systems:

This chapter discusses content-based recommendation systems, i.e., systems that recommend an item to a user based upon a description of the item and a profile of the user’s interests. Content-based recommendation systems may be used in a variety of domains ranging from recommending web pages, news articles, restaurants, television programs, and items for sale. Although the details of various systems differ, content-based recommendation systems share in common a means for describing the items that may be recommended, a means for creating a profile of the user that describes the types of items the user likes, and a means of comparing items to the user profile to determine what to re commend. The profile is often created and updated automatically in response to feedback on the desirability of items that have been presented to the user.

Personalized web exploration with task models:

Personalized Web search has emerged as one of the hottest topics for both the Web industry and academic researchers. However, the majority of studies on personalized search focused on a rather simple type of search, which leaves an important research topic - the personalization in exploratory searches - as an under-studied area. In this paper, we present a study of personalization in task-based information exploration using a system called TaskSieve. TaskSieve is a Web search system that utilizes a relevance feedback based profile, called a "task model", for personalization. Its innovations include flexible and user controlled integration of queries and task models, task-infused text snippet generation, and on-screen visualization of task models. Through an empirical study using human subjects conducting task-based exploration searches, we demonstrate that TaskSieve pushes significantly more relevant documents to the top of search result lists as compared to a traditional search system. TaskSieve helps users select significantly more accurate information for their tasks, allows the users to do so with higher productivity, and is viewed more favorably by subjects under several usability related characteristics.

Week 11: Muddiest Points

1. What is the difference between parallel corpora and comparable corpora?

2. How to solve out-of-vocabulary term problem?

3. How to achieve parallel in IR?

Tuesday, March 25, 2014

Week 11: Reading Notes

Cross-Language Information Retrieval. Annual Review of Information Science and Technology

This chapter reviews research and practice in cross-language information retrieval (CUR) that seeks to support the process of finding documents written in one natural language (e.g., English or Portuguese) with automated systems that can accept queries expressed in other languages. With the globalization of the economy and the continued internationalization of the Internet, CUR is becoming an increasingly important capability that facilitates the effective exchange of information. For retrospective retrieval, CUR allows users to state questions in their native language and then retrieve documents in any supported language. This can simplify searching by multilingual users and, if translation resources are limited, can allow searchers to allocate those resources to the most promising documents. In selective dissemination applications, CUR allows monolingual users to specify a profile using words from one language and then use that profile to identify promising documents in many languages. Adaptive filtering systems that seek to learn profiles automatically can use CUR to process training documents that may not be in the same language as the documents that later must be selected.Cross-Language Information Retrieval. In Ayse Goker, John Davies, Margaret Graham (eds)

CLIR are available in most main search engine and it brings a great convenience for people to retrieve documents contains multiple language. The most common methods are that the query is translated into one language if the query contains multiple language. I am wondering how to choose the priority language. For example, if a query contains Japanese and English, the system should translate the Japanese into English or translate English into Japanese?
Moreover, in non-CLIR we use the Boolean query and index the term and find the match index. Why we cannot just simple index several language in different index and when we need to fulfill the query, we just retrieve the docs contains all or some the term in query in both language. Finally, I am wondering the market share for CLIR in IR market in US market. Because I think most often cross language query are query using English and other language. So we just need to focus on combining English with other Language.
From the beginning of this semester, I am wondering the tech for retrieve multimedia materials. After reading the materials, I got some basic ideas. Mate data plays a great position in multimedia search, since compare to multimedia, text are very simple and handful. But at the same time, I believe image recognition is very helpful for matching query text and the materials in the retrieve library. But for the index of multimedia are not as simple as the text, I am wondering how the index ordered, if the index contains pictures or videos.

IES chapter 14 parallel information retrieval

Information retrieval systems often have to deal with very large amounts of data. They must be able to process many gigabytes or even terabytes of text, and to build and maintain an index for millions of documents. To some extent the techniques discussed in Chapters 5–8 can help us satisfy these requirements, but it is clear that, at some point, sophisticated data structures and clever optimizations alone are not suﬃcient anymore. A single computer simply does not have the computational power or the storage capabilities required for indexing even a small fraction of the World Wide Web.1
In this chapter we examine various ways of making information retrieval systems scale to very large text collections such as the Web. The ﬁrst part (Section 14.1) is concerned with parallel query processing, where the search engine’s service rate is increased by having multiple index servers process incoming queries in parallel. It also discusses redundancy and fault tolerance issues in distributed search engines. In the second second part (Section 14.2), we shift our attention to the parallel execution of oﬀ-line tasks, such as index construction and statistical analysis of a corpus of text. We explain the basics of MapReduce, a framework designed for massively parallel computations carried out on large amounts of data.

Week 10: Muddiest Points

1. Why web query logs are so important?

2. What is the main function of crawler?

3. What is "The HITS algorithm: Grab pages"?

Monday, March 3, 2014

Week 10: Reading Notes

IIR Chapter 19:

In this and the following two chapters, we consider web search engines. Sections 19.1–19.4 provide some background and history to help the reader appreciate the forces that conspire to make the Web chaotic, fast-changing and (from the standpoint of information retrieval) very different from the “traditional” collections studied thus far in this book. Sections 19.5–19.6 deal with estimating the number of documents indexed byweb search engines, and the elimination of duplicate documents in web indexes, respectively. These two latter sections serve as background material for the following two chapters.
We can view the static Web consisting of static HTML pages together with the hyperlinks between them as a directed graph in which each web page is a node and each hyperlink a directed edge. Early in the history of web search, it became clear that web search engines were an important means for connecting advertisers to prospective buyers.
One aspect we have ignored in the discussion of index size in Section 19.5 is duplication: the Web contains multiple copies of the same content. By some estimates, as many as 40% of the pages on the Web are duplicates of other pages. Many of these are legitimate copies; for instance, certain information repositories are mirrored simply to provide redundancy and access reliability. Search engines try to avoid indexing multiple copies of the same content, to keep down storage and processing overheads

IIR Chapter 21:

The analysis of hyperlinks and the graph structure of the Web has been instrumental in the development of web search. In this chapterwe focus on the use of hyperlinks for ranking web search results. Such link analysis is one of many factors considered by web search engines in computing a composite score for a web page on any given query. We begin by reviewing some basics of the Web as a graph in Section 21.1, then proceed to the technical development of the elements of link analysis for ranking. Link analysis for web search has intellectual antecedents in the field of citation analysis, aspects of which overlap with an area known as bibliometrics. These disciplines seek to quantify the influence of scholarly articles by analyzing the pattern of citations amongst them. Much as citations represent the conferral of authority from a scholarly article to others, link analysis on the Web treats hyperlinks from a web page to another as a conferral of authority. Clearly, not every citation or hyperlink implies such authority conferral; for this reason, simply measuring the quality of a web page by the number of in-links (citations from other pages) is not robust enough. For instance, one may contrive to set up multiple web pages pointing to a target web page, with the intent of artificially boosting the latter’s tally of in-links. This phenomenon is referred to as link spam. Nevertheless, the phenomenon of citation is prevalent and dependable enough that it is feasible for web search engines to derive useful signals for ranking from more sophisticated link analysis. Link analysis also proves to be a useful indicator of what page(s) to crawl next while crawling the web; this is done by using link analysis to guide the priority assignment in the front queues of Chapter 20. Section 21.1 develops the basic ideas underlying the use of the web graph in link analysis. Sections 21.2 and 21.3 then develop two distinct methods for link analysis, PageRank and HITS.

Authoritative sources in a hyperlinked environment:

The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have e ective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their e ectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of \authoritative" information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of \hub pages" that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.
The further development of link-based methods to handle information needs otherthan broad-topic queries on the www poses many interesting challenges. As noted above, work has been done on the incorporation of textual content into our framework as a way of \focusing" a broad-topic search [6, 10, 11], but one can ask what other basic informational structures one can identify, beyond hubs and authorities, from the link topology of hypermedia such as the www. The means by which interaction with a link structure can facilitate.the discovery of information is a general and far-reaching notion, and we feel that it will continue to o er a range of fascinating algorithmic possibilities.

The Anatomy of a Large-Scale Hypertextual Web Search Engine:

In this paper, we present Google, a prototype of a large-scale search engine which makes heavyuse of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three
years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.