Getting Real with Unreal Data: Lessons Learned and the Way Ahead

Thore Graepel, Royal Holloway, University of London

I give a general introduction to the problem of learning non-vectorial, “unreal” data. In particular, I give and discuss a number of examples of non-vectorial data that occur in practical applications ranging from music via game positions to XML data. Various methods of dealing with non-vectorial data have been developed in different communities. Taking up ideas presented at the 1998 NIPS workshop “Learning on Relational Data Representations” I will give an overview of methods ranging from purely symbolic approaches like ILP to embedding approaches like Plate's HRRs to kernel or similarity based approaches.

Following to some extent the arguments of Lev Goldfarb I argue that the geometry-based models many of us are used to have severe limitations and that in order to deal with “unreal” data we need to take into account their compositional structure.

Undirected graphical models for sequence analysis

Fernando Pereira, University of Pennsylvania

A few years ago we started investigating alternatives to HMMs for shallow parsing and information extraction in text processing. I will start by describing main problems and previous approaches in this area, and then discuss our proposal, a class of models we call “conditional random fields” (CRFs) that offer several advantages over generative models like hidden Markov models and stochastic grammars, including natural discriminative training and the ability to relax independence assumptions. I will discuss training methods for CRFs, and present experimental results on synthetic and natural-language data that show their advantages over HMMs and classifier-based sequence analysis methods. I will conclude with related work on undirected graphical models and generalized Perceptron algorithms.

Joint work with John Lafferty, Andrew McCallum, and Fei Sha.

Marginalized Kernels for Biological Sequences

Koji Tsuda Computational Biology Research Center, Tokyo

DNA, RNA and proteins are represented as discrete symbol sequences. When applying discriminative learning methods such as SVMs, they have to be transformed to vectors. Typically, it is implemented by counting the number of subsequences as in most string kernels. However this scheme is problematic, because two symbols which are far from each other in a sequence often have strong correlation. This remote correlation problem has been solved by adopting a probabilistic model such as the stochastic context free grammar (SCFG). Here we consider to extract vectors (or equivalently kernels) from such a probabilistic model in order to deal with remote correlations.

We propose a general way of designing a kernel from latent variable models (e.g. SCFG). First of all, a “joint kernel” is designed for complete data which include both visible and hidden variables. Then a “marginalized kernel” for visible data is obtained by taking expectation with respect to hidden variables. We actually derive several marginalized kernels useful for biological sequences. Especially, marginalized kernels derived from SCFG are shown to work well for RNA sequences, which typically have many remote correlations.

Algorithmic Challenges for Speech Mining

Mehryar Mohri, AT&T Research

Most classical learning and data analysis algorithms operate on vectorial input data. Real-world information sources such as speech are noisy and highly variable. Modern algorithms must deal with a large number of weighted and variable-length alternative sequences such as those output by a speech recognition system. This talk will describe some of the challenges faced in speech data mining and mention some partial answers.

Joint work with Corinna Cortes and Patrick Haffner.

Graphical Models for Non-vectorial Data

Zoubin Ghahramani, Gatsby Computational Neuroscience Unit, University College London

I will provide an overview of the advantages and limitations of using probabilistic graphical models to model data that are non-vectorial. Of course, classic examples of this include hidden Markov models for sequential data such as speech or amino acid sequences, and markov random fields for spatial data such as images. Graphical models can also be used to model tree-structured data (i.e. in the form of probabilistic context free grammars, or multi-scale Kalman filtering models), relational data, three dimensional structure of proteins, etc, although in some instances it is not clear whether the 'graphical’ nature of the probabilistic model is essential. Although graphical models provide a framework for modeling the data distribution it is not always straightforward to use this model for a discriminative classification or regression task. I will discuss several methods for deriving from the graphical model a vectorial representation of the data or a kernel, which in turn can be used in standard discriminative training procedures.

The Structure in Computer Vision Problems

Alan Yuille, Smith-Kettlewell Eye Research Institute

This talk described several prototypical vision problems and the way they are being addressed. Then I will discuss possible relationships to machine learning, classification, and regression.

Practical Non-parametric Density Estimation on a Transformation Group for Vision

Erik Miller, University of California at Berkeley

It is now common practice in machine vision to define the variability in an object's appearance in a factored manner, as a combination of shape and texture transformations. In this context, we present a simple and practical method for estimation non-parametric probability densities over a group of linear shape deformations. Samples drawn from such a distribution do not lie in a Euclidean Space, and standard kernel density estimates may perform poorly. While variable kernel estimators may mitigate this problem to some extent, the geometry of the underlying configuration space ultimately demands a kernel which accommodates its group structure. In this perspective, we propose a suitable equivariant estimator on the general linear group GL(n,R). We illustrate this approach by modeling image transformations in digit recognition problems, and present results showing the superiority of our estimator to comparable Euclidean estimators in this domain.

Joint work with Christophe Chef'dhotel.

Exponential and Geometric Kernels for Graphs

Thomas Gärtner, Fraunhofer Institut Autonome Intelligente Systeme

Labeled graphs are widely used in computer science to model data. Recently, research has begun investigating how positive definite kernel functions can be defined for graphs. Until now, however, most work has focused on one particular case only in which the instances of the learning problem are related to each other by a graph structure. Approaches differ with respect to the type of graph considered (e.g., trees) and with respect to the definition of the instances (e.g, an instance corresponds to a single vertex or to a set of vertices). A different challenge is to investigate kernels on instances that themselves are represented by graphs. Here, only very specific graphs have been considered so far. This paper investigates kernels on labeled directed graphs with general structure. We introduce a positive definite kernel function that can efficiently be computed and show that it is complete, i.e., that no information about the structure of the graph is lost.

Kernels on Automata

S.V.N. Vishwanathan, Indian Institute of Science

Automata are powerful abstractions very frequently used in computer science. They are intimately related to Hidden Markov Models as well as dynamic systems. In this talk we present kernels defined by various automata inspired by our work on string and tree kernels.

Comparing convolutional kernels and recursive networks on a wide-coverage computational analysis of natural language

Paolo Frasconi, Univ. Firenze

Prediction of first-pass attachment is a very important task in linguistics because of the widely accepted psycholinguistic hypothesis of incrementality in human language processing. The task is also interesting from a machine learning perspective since it involves ranking labeled trees forming a forest of alternatives. In previous work (Costa et. al 2000,2003) we have introduced a solution based on recursive neural networks (RNN). Conceptually, the method is based on two steps. First, a tree is mapped into a real vector of distributed features by recursive application of an adaptive transition function, processing the tree in a bottom up fashion. Second, a preference model is trained on the forest of alternative candidates by maximizing the multinomial probability that the correct tree (representing the correct first-pass attachment) is selected. The method achieved 74% first-pass attachment accuracy on a dataset of 500 sentences, significantly outperforming all previously known psycholinguistic heuristics.

More recently, Collins & Duffy (2001) have proposed an algorithm based on convolutional kernels and the voted Perceptron for learning to identify correct parse trees in a set of candidates. First-pass attachment and parse ranking can be seen as two instances of the same machine learning problem (i.e. learning preferences on labeled forests) and the two algorithms mentioned above can be applied to both tasks. They differ in one fundamental aspect. Convolutional kernels map a tree into a vector in a very high-dimensional feature space and attempt to regularize the separation hyperplane in feature space by seeking a large margin solution. Conversely, RNNs map a tree into a low-dimensional feature space but the mapping is adaptive and driven by the specific learning problem. To the best of our knowledge these two algorithms have not been compared on the same data.

Here we report several new ideas to enhance the architectural design, of RNNs, obtaining significant improvements of prediction accuracy, up to 80.5% (on the same dataset used in the above referenced work). In a wide-coverage experiment (the entire Wall Street Journal section of the Penn Treebank) our best prediction accuracy reaches 87%. We also report experiments comparing the learning curves of RNNs and convolutional kernels used in conjunction with the voting Perceptron algorithm. Our preliminary results indicate that the latter method does not appear to offer advantages over RNNs in terms of prediction accuracy. However, it is significantly outperformed in terms of computational cost. Surprisingly, this is true also for small datasets. For very small datasets we managed to train support vector machines, yet the prediction accuracy does not differ significantly compared to the voting Perceptron. These results suggest that 'general purpose’ kernels are less likely to be effective when dealing with hierarchical data structures and that designing proper kernel functions is a step that plays a very important role.