The DDI Approach to Features: Don't Do It (Slides)
25 years ago, Mercer, a member of the IBM speech group, tossed the bomb shell, “There is no data like more data!” A few years later, his boss, Jelinek, added, “Whenever I fire a linguist our system performance improves.” Jelinek later moved to Hopkins and hired Brill who concluded, “It never pays to think until you've run out of data,” and therefore we should “fire everyone and spend the money on data.” I will refer to this period as the DDI (don't do it) approach to features (and everything else).
A decade later, the web came along and life was good for the DDI approach. But the community is beginning to discover that although the web is large, it may not be that large. The web is also making it clear that Information Retrieval is not a simple game of Solitaire with a user up against the house (the library), but modern web search is a much more interesting multiplayer game with authors, users, advertisers and spammers all working together (as well as cross purposes). Different features are more appropriate for different perspectives. Features like TF, IDF and Page Rank shed light on authors’ perspectives, but users are more interested in readers’ perspectives.
Recently, we have been working on classification of audio documents with few (if any) resources (dictionaries/corpora). It is standard practice to connect black boxes (ASR speech recognition, IR, NLP) in series. Although this approach tends to multiply errors, GALE has demonstrated that it can be effective, especially for high resource languages (English, Chinese and Arabic). We have been exploring an alternative for languages with few resources (most languages) that applies classification directly to the speech without taking a dependency on ASR. This approach takes us full circle back to the DDI approach to features.
Online Feature Selection for Information Retrieval (Slides)
In this talk, we describe an approach for automatic query reduction on the Web, and discuss its connections to online feature selection for information retrieval.
Long web queries form an important segment in Web search. However, most web search engines perform poorly for long queries. We propose automatic query reduction as an approach for improving the effectiveness of long web queries. Specifically, we focus on a relative effectiveness estimation technique that enables automatic selection of reduced representations of the original long query.
This approach is related to feature selection in two ways. First, the effectiveness estimation technique that we develop can be viewed as an online feature selection technique. Web search engines use features that can be associated with each query term (or a set of query terms). These features are then combined in a machine learning framework to score documents. In this setting, selecting the best reduced version of a long query can be viewed as selecting the best subset of features during ranking. Second, the effectiveness estimation technique can also be used to directly select between ranking algorithms trained on multiple feature subsets. We will present some early results in this direction.
While traditional feature selection techniques that aim to reduce redundancy in features and select the best subset of features that work well for all future queries, the techniques we propose aim to identify features that work the best for each individual query. Given the typically large query-level variance in retrieval performance, we believe that such query-dependent selection of features can further help to improve the performance of learning algorithms.
Joint work with Giridhar Kumaran and Vitor Carvalho.
Hubness is a property of vector-space data expressed by the tendency of some points (hubs) to be included in unexpectedly many k-nearest neighbor (k-NN) lists of other points in a data set, according to commonly used similarity/distance measures. Alternatively, hubness can be viewed as increased skewness of the distribution of node in-degrees in the k-NN digraph obtained from a data set. Hubness has been observed in several communities (e.g., audio retrieval), where it was described as a problematic situation. The true causes of hubness have, for the most part, eluded researchers, which is somewhat surprising given that the phenomenon represents a fundamental characteristic of vector-space data.
Joint work with Milos Radovanovic and Mirjana Ivanovic.
Information Retrieval and Machine Learning at the Long Tail (Slides)
In this presentation we discuss how algorithms in IR and ML are adapted to build systems in the context of eBay Marketplace. The long tail nature of “few-of-a-kind” items create unique challenges for building catalogable products. The lack of catalogable products, in turn, create unique challenges building search applications, recommender systems and managing trust. However, the participatory nature of the users combined with availability of large amount of data also provides unique opportunities for incorporating user input into building simple scalable algorithms.
Ranking is a central problem in information retrieval. Much work has been done in the recent years to automate the development of ranking models by means of supervised machine learning. Feature selection aims to provide sparse models which are computationally efficient to evaluate, and have good ranking performance. We propose integrating the feature selection as part of the training process for the rank- ing algorithm, by means of a wrapper method which performs greedy forward selection, using leave-query-out cross-validation estimate of performance as the selection criterion. We introduce a linear time training algorithm we call greedy RankRLS, which combines the aforementioned procedure, together with regularized risk minimization based on pairwise least-squares loss. The training complexity of the method is O(kmn), where k is the number of features to be selected, m is the number of training examples, and n is the overall number of features. Experiments on the LETOR benchmark data set demonstrate that the approach works in practice.
Joint work with Antti Airola, Pekka Naula and Tapio Salakoski.
The ability of fast similarity search at large scale is of great importance to many Information Retrieval (IR) applications. A promising way to accelerate similarity search is semantic hashing which designs compact binary codes for a large number of documents so that semantically similar documents are mapped to similar codes (within a short Hamming distance). Since each bit in the binary code for a document can be regarded as a binary feature of it, semantic hashing is essentially a process of generating a few most informative binary features to represent the documents. Recently, we have proposed a novel Self-Taught Hashing (STH) approach to semantic hashing (that is going to be published in SIGIR-2010): we first find the optimal l-bit binary codes for all documents in the given corpus via unsupervised learning, and then train l classifiers via supervised learning to predict the l-bit code for any query document unseen before. In this paper, we present two further extensions to our STH technique: one is kernelisation (i.e., employing nonlinear kernels to achieve nonlinear hashing), and the other is supervision (i.e., exploiting the category label information to enhance the effectiveness of hashing). The advantages of these extensions have been shown through experiments on synthetic datasets and real-world datasets respectively.
Joint work with Jun Wang, Deng Cai and Jinsong Lu.
Hierarchical Bayesian Models of Language and Text (Slides)
Yee Whye Teh
In this talk I will present a new approach to modelling sequence data called the sequence memoizer. As opposed to most other sequence models, our model does not make any Markovian assumptions. Instead, we use a hierarchical Bayesian approach which enforces sharing of statistical strength across the different parts of the model. To make computations with the model efficient, and to better model the power-law statistics often observed in sequence data, we use a Bayesian nonparametric prior called the Pitman-Yor process as building blocks in the hierarchical model. We show state-of-the-art results on language modelling and text compression.
Joint work with Frank Wood, Jan Gasthaus, Cedric Archambeau and Lancelot James.
Feature selection is an important problem in machine learning since it helps reduce the number of features a learner has to examine and reduce errors from irrelevant features. Even though feature selection is well studied in the area of classification, this is not the case for ranking algorithms. In this paper, we propose a feature selection technique for ranking based on the wrapper approach used in classification. Our method uses the best first search strategy incrementally to partition the feature set into subsets. Features in each subset are then combined into a single feature using coordinate ascent in such a way that it maximizes any defined retrieval measure on a training set. Our experiments with many state-of-the-art ranking algorithms, namely RankNet, RankBoost, AdaRank and Coordinate Ascent, have shown that the proposed method can reduce the original set of features to a much more compact set while at least retaining the ranking effectiveness regardless of the ranking method in use.
Joint work with Bruce Croft.
Query-oriented summarization is primarily concerned with synthesizing an informative and well-organized summary from a document collection for a given query. In the existing summarization methods, each individual sentence in the document collection is represented as Bag of Words (BOW). In this paper, we propose a novel framework which improves query-oriented summarization via sentence wikification, i.e., enriching sentence representation with Wikipedia concepts. Furthermore, we exploit semantic relatedness of Wikipedia concepts as a smoothing factor in sentence wikification. The experiments with benchmark dataset show that sentence wikification performs effectively for improving query-oriented summarization, and helps to generate more high-quality summaries.
Joint work with Chunping Li.
Predicting user preferences is a core task in many online applications from ad targeting to content recommendation. Many prediction methods rely on the being able to represent the user by a profile of features. In this paper we propose a mechanism for generating such profiles by extracting features that summarize their past online behavior. The method relies on finding a compressed representation of the behavior by selecting the dominant features contributing to the Kullback-Leibler divergence between the default distribution over user actions and the user specific properties. We show that the feature selection model of Konopnicki et al. can be extended to a hierarchical encoding of user behavior by means of using an intermediate clustering representation. Preliminary experiments suggest the efficacy of our method.
Joint work with Xiaoiao Shi, Kevin Chang, Vijay K. Narayanan and Alexander J. Smola.
Relevance judgement method is an essential part of a feed search engine, which determines candidates for ranking query results. However, the different characteristics of feeds from traditional web pages may make existing relevance judgement approaches proposed for web page retrieval produce unsatisfactory result. Compared to web pages, feed is a structured document in that it contains several data elements, including title and description. In addition, feed is a temporal document since it dynamically publishes information on some specific topics over time. Accordingly, the relevance judgement method for feed retrieval needs to effectively address these unique characteristics of feeds. This paper considers a problem of identifying significant features which are a feature set created from feed data elements, with the aim of improving effectiveness of feed retrieval. We conducted extensive experiments to investigate the problem using support vector machine on real-world data set, and found the significant features that can be used for feed search services.
Joint work with Jonghun Park.
Representing and Exploiting Searcher Interaction Data
Understanding of the relationship between searcher intent and behavior is crucial for improving web search. Recently, it has been shown that the information contained in fine-grained searcher interactions, such as mouse movements and scrolling, can be useful for predicting user intent, for automatic search evaluation, and for clickthrough prediction. However, representing the searcher behavior and interaction data is a challenging task. This talk will describe our recent experiments on representing and exploiting searcher interaction data for predicting user intent and ad clickthrough behavior.