NIPS 1997 Workshop

Support Vector Machines

Organizers: Leon Bottou, Chris Burges, Bernhard Schoelkopf, Alex Smola

Location: Breckenridge, Colorado, December 6, 1997

Abstract

The Support Vector (SV) learning algorithm (Boser, Guyon, Vapnik, 1992; Cortes, Vapnik, 1995; Vapnik, 1995) provides a general method for solving Pattern Recognition, Regression Estimation and Operator Inversion problems. The method is based on results in the theory of learning with finite sample sizes. The last few years have witnessed an increasing interest in SV machines, due largely to excellent results in pattern recognition, regression estimation and time series prediction experiments. The purpose of this workshop is (1) to provide an overview of recent developments in SV machines, ranging from theoretical results to applications, (2) to explore connections with other methods, and (3) to identify weaknesses, strengths and directions for future research for SVMs. We invite contributions on SV machines and related approaches, looking for empirical support wherever possible.

Topics

  • SV Applications
  • Benchmarks
  • SV Optimization and implementation issues
  • Theory of generalization and regularization
  • Learning methods based on Hilbert-Schmidt kernels (e.g. kernel PCA)
  • Links to related methods and concepts (e.g. boosting, fat shattering)
  • Representation of functions in SV machines (e.g. splines, anova)

Schedule

  • 7:30 V. Vapnik, AT&T Labs – Research “The SVM method of function approximation”
  • 8:00 G. Wahba, University of Wisconsin-Madison “Reproducing Kernel Hilbert Spaces, Smoothing Spline ANOVA spaces, Support Vector Machines, and all that”
  • 8:30 L. Kaufman, Bell Laboratories, Lucent Technologies “Solving the Quadratic Programming problem arising in support vector classification”
  • 8:45 C. Burges, Bell Laboratories, Lucent Technologies “Geometry of Support Vector Machines”
  • 9:00 B. Scholkopf, Max-Planck-Inst. Tubingen and GMD Berlin “Kernel Principal Component Analysis”
  • 9:15 D. Mattera and S. Haykin, McMaster University “Support Vector Machines for Dynamic Reconstruction of Chaotic Processes”
  • 9:40 K. Muller, GMD First, Berlin “Predicting Time Series with Support Vector Machines”
  • 10:00 A. Oren and D. Roth, University of Illinois, Urbana/Champaign “Support Vectors in Natural Language Disambiguation Tasks”
  • 10:15 T. Joachims, Universitat Dortmund “Support Vector Machines for Text Categorization”
  • 10:30 skiing
  • 16:00 F. Girosi, MIT, Cambridge “Support Vector Machines, Regularization Theory and Sparse Approximation”
  • 16:25 K. Bennett, Rensselaer Polytechnic Institute “Combining Support Vector and Mathematical Programming Models for Induction”
  • 16:50 M. Stitson, Royal Holloway London “Support Vector ANOVA Decomposition”
  • 17:05 J. Weston, Royal Holloway London “Density Estimation using Support Vector Machines”
  • 17:20 A. Smola, GMD Berlin “General Cost Functions for Support Vector Regression”
  • 17:35 Y. Freund, AT&T Labs – Research “Adaboost as a procedure for margin maximization”
  • 18:00 J. Shawe-Taylor, Royal Holloway London “Data-sensitive PAC Analysis for SV and Other Machines”
  • 18:15 N. Cristianini, University of Bristol “Bayesian Voting Schemes and Large Margin Algorithms”
  • 18:30 D. Schuurmans, U. Pennsylvania & NEC Research “Improving and generalizing the basic maximum margin algorithm”
  • 18:45 open discussion
  • 19:00 end

Abstracts

Combining Support Vector and Mathematical Programming Models for Induction

K. Bennett, Mathematical Sciences Department, Rensselaer Polytechnic Institute

Research in mathematical programming (MP) has much to contribute to support vector machines (SVM) beyond powerful optimization algorithms. Existing MP approaches to modeling induction problems are closely related to SVM. In this talk we examine MP models for classification, regression, and clustering problems. We will show how SVM concepts are readily incorporated into these models. The result is improved performance of the MP models and suggestions for new research directions for SVM.

Geometry of Support Vector Machines

C. Burges, Lucent, Bell Laboratories

The kernel mapping inherent in support vector machines induces a metric on the space in which the data lives (“input space”). I show that this metric can be expressed in terms of the kernel; that every positive Riemannian metric on input space has an associated kernel mapping; and that positivity of the metric leads to some very simple conditions that the kernel must satisfy, which are much easier to check than Mercer’s condition. I will also describe work to incorporate known invariances of the problems directly into the kernels themselves.

Bayesian Voting Schemes and Large Margin Algorithms

Nello Cristianini, University of Bristol

It is often claimed that one of the main distinctive features of Bayesian Learning Algorithms is that they don’t simply output one hypothesis, but rather an entire distribution of probability over an hypothesis set: the Bayes posterior. An alternative perspective is that they output a linear combination of classifiers, whose coefficients are given by Bayes theorem. Such a linear combination can be described like a hyperplane in a high dimensional space: a kind of hypothesis commonly used in Statistical Learning Theory. One of the concepts used to deal with thresholded convex combinations is the “margin” of the hyperplane with respect to the training sample, which is correlated to the predictive power of the hypothesis itself. In this work, we study the average margin of the Bayesian Classifiers in the case of pattern classification, with a uniform prior. We prove that under certain conditions it can be large. The average margin is conjectured to be a good indicator of the generalization power of the hypothesis, as suggested by experimental results and theoretical results on analogous quantities.

Adaboost as a Procedure for Margin Maximization

Y. Freund, AT&T Labs – Research

Adaboost (Freund, Schapire95) is an algorithm for improving the performance of learning algorithm. It does that by running the base learning algorithm several times and combining the resulting hypotheses by a weighted majority rule. It has recently been demonstrated that the majority rule has unexpectedly low generalization error (Cortes and Drucker 96, Breiman 96, Quinlan 96). It has also been shown, both theoretically and experimentally, that this behavior is connected to the fact that the weighted majority vote rule has large margins (Schapire, Freund, Bartlett, Lee 97). These results establish an interesting connection between Support Vector Machines and Adaboost. An interesting difference between the method is that in SVM one uses the \(l_2\) norm to measure the length of the vector that defines the linear separator and the length of the “largest” example. On the other hand, in adaboost, the separation vector is measured using the \(l_1\) norm and the examples are measured using the \(l_\infty\) norm (max norm). This technical difference is the reason that adaboost is closely related to linear programming while SVM is related to quadratic programming. It also results in incomparable theoretical performance bounds.

Support Vector Machines, Regularization Theory and Sparse Approximation

F. Girosi, Center for Biological and Computational Learning, MIT

In this talk I will discuss the equivalence between two different approximation techniques: the Support Vector Machine (SVM), proposed by V. Vapnik (1995), and a modified version of the Basis Pursuit De-Noising algorithm (BP) (Chen, 1995; Chen, Donoho and Saunders, 1995). BP is an instance of a sparse approximation technique, in which a function is reconstructed by using the least possible number of basis functions chosen from a large set (dictionary). We show that, under certain conditions, these two techniques are equivalent in the following sense: if applied to the same data set they give the same solution, which is obtained by solving the same quadratic programming problem. We also present a derivation of the SVM technique in the framework of regularization theory, rather than statistical learning theory, establishing a connection between SVM, sparse approximation and regularization theory.

Support Vector Machines for Text Categorization

Thorsten Joachims, Universitat Dortmund

This presentation explores the use of Support Vector Machines (SVMs) for learning text classifiers from examples. It identifies the particular properties of learning with text data: (a) very high dimensional (> 10000) input spaces, (b) most of the features are relevant (i.e. dense target concepts), and (c) sparse instance vectors. Theoretical considerations indicate that SVMs are well suited for learning in this setting. Empirical results support the theoretical findings. SVMs achieve substantial improvements over the currently best performing methods, eliminating the need for feature selection. Furthermore, their robustness makes them very promising for practical applications.

Solving the Quadratic Programming Problem Arising in Support Vector Classification

L. Kaufman, Lucent, Bell Laboratories

The support vector technique suggested by Cortes and Vapnik is designed to solve the classification problem of determining the class of a data point when you are given some characteristics of the point (a vector) and many “training” data of classes and characteristics. As shown by Cortes and Vapnik, the support vector technique leads to an indefinite quadratic programming problem with a dense and structured matrix, bound constraints, and one general equality constraint. The number of rows in the matrix equals the number of training data points. Thus for a problem with thousands of training points, just computing the matrix for the quadratic programming problem is expensive and one may not be able to store it. In this paper we discuss several methods for solving the underlying quadratic programming problem including a projected Newton approach and a conjugate gradient approach.

Support Vector Machines for Dynamic Reconstruction of Chaotic Processes

D. Mattera and S. Haykin, Neurocomputation for Signal Processing Group, McMaster University

Dynamic reconstruction is an inverse problem that deals with reconstructing the dynamics of an unknown system, given a noisy time series representing the evolution of one variable of the system with time. The solution to the problem rests on Takens’ embedding theorem. The reconstruction proceeds by using the time series to build a predictive model of the system and then using iterated prediction to test what the model has learned from the training data on the dynamics of the system. In this paper we present the results of an investigation into the use of the support vector machine for solving the dynamic reconstruction problem. Two series of experiments are described, one using data from a noisy Lorenz attractor, and the other using real life data on the radar backscatter from an ocean surface.

Predicting Time Series with Support Vector Machines

K.-R. Muller, GMD First, Berlin

Support Vector Machines are used for time series prediction and compared to radial basis function networks. We make use of two different cost functions for Support Vectors: training with (i) an epsilon-insensitive loss and (ii) Huber’s robust loss function and discuss how to choose the regularization parameters in these models. Two applications are considered: data from (a) a noisy (normal and uniform noise) Mackey Glass equation and (b) the Santa Fe competition (set D). In both cases Support Vector Machines show an excellent performance. In case (b) the Support Vector approach improves the best known result on the benchmark by a factor of 37%.

Support Vectors in Natural Language Disambiguation Tasks

Amichay Oren and Dan Roth, University of Illinois – Urbana/Champaign

Learning problems in the natural language domain often map the text to a space whose dimensions are the measured features of the text, e.g., its words. Two characteristic properties of this domain are that its dimensionality is very high and that both the learned concepts and the instances reside very sparsely in the feature space. We study the problem of Context-Sensitive Spelling Correction. We adapt these algorithms to work in this domain and show that they are able to exploit the full set of features and outperform methods that first prune the feature space.

Kernel Principal Component Analysis

Bernhard Scholkopf, MPI fur biologische Kybernetik, Tubingen, and GMD Berlin

A new method for performing a nonlinear form of Principal Component Analysis is proposed. By the use of SV kernel functions, one can efficiently compute principal components in high-dimensional feature spaces, related to input space by some nonlinear map; for instance the space of all possible d-pixel products in images. The algorithm consists of solving an Eigenvalue problem for the kernel dot product matrix.

Improving and Generalizing the Basic Maximum Margin Algorithm

Dale Schuurmans, U. Pennsylvania & NEC Research

I consider a simple extension of Vapnik’s maximum margin algorithm that exploits additional unlabeled training examples. The idea is to measure the distance between hyperplanes, not by the \(L_2\) distance of their weight vectors, but by their disagreement distance on the natural distribution of unlabeled examples. I show that this simple trick can lead to significant improvements in generalization accuracy in several simulated domains.

Data-sensitive PAC Analysis for SV and Other Machines

John Shawe-Taylor, Royal Holloway London

Contrary to popular belief, to obtain rigorous PAC bounds for SV machines, it is not sufficient to estimate the VC dimension of large margin hyperplanes, since standard structural risk minimisation cannot be applied in this case. Resolving this technical difficulty requires a bound on covering numbers in terms of the fat-shattering dimension. The analysis throws new light on why the SV machine gives good generalization and can also be applied to help explain the performance of the ada-boost algorithm.

General Cost Functions for Support Vector Regression

Alex Smola, GMD Berlin

The concept of Support Vector Regression is extended to a more general class of convex cost functions. Moreover it is shown how the resulting convex constrained optimization problems can be efficiently solved by a Primal-Dual Interior Point path following method. Both computational feasibility and improvement of estimation is demonstrated in the experiments.

Support Vector ANOVA Decomposition

Mark O. Stitson, Royal Holloway London

This talk will introduce Support Vector ANOVA decomposition (SVAD) applied to a well known data set. SVAD is used with various kernels and preliminary results have shown that SVAD performs better than the respective non ANOVA decomposition kernels. The Boston housing data set from UCI has been tested on Bagging and Support Vector methods before and these results are compared to the SVAD method.

The SVM Method of Function Approximation

V. Vapnik, AT&T Research

This talk will address the representation of high dimensional functions using Support Vector Machines. Besides being a learning algorithm, Support Vector Machines provide a linear and compact representation of high dimensional non linear functions. These properties (compacity and linearity) provide several opportunities which I would like to discuss.

Reproducing Kernel Hilbert Spaces, Smoothing Spline ANOVA Spaces, Support Vector Machines, and All That

Grace Wahba, Statistics, University of Wisconsin-Madison

In this (mostly) review talk, we look at reproducing kernel Hilbert spaces (RKHS) as a natural home for studying certain support vector machines and their relations with other multivariate function estimation paradigms. We examine Lemma 6.1 of Kimeldorf and Wahba (J. Math. Anal. Applic. 1971, p 92) concerning linear inequality constraints in RKHS, and note how it applies in arbitrary RKHS. We will particularly note the potential use of the reproducing kernels associated with smoothing spline ANOVA spaces, for applications with heterogeneous predictor variables.

Density Estimation Using Support Vector Machines

Jason Weston, Royal Holloway London

I show how the SV technique of solving linear operator equations can be applied to the problem of density estimation. I describe a new optimisation procedure and set of kernels closely related to current SV techniques that guarantee the monotonicity of the approximation. This technique estimates densities with a mixture of bumps (Gaussian-like shapes), with the usual SV property that only some coefficients are non-zero. Both the width and the height of each bump is chosen adaptively, by considering a dictionary of several kernel functions.

Summaries

V. Vapnik (AT&T Research)

The SVM method for function approximation (Theory):

  1. Will get better SVM performance if find kernels that minimize D/r, where D is diameter of minimal sphere containing data in hi-dim space, and r is the margin
  2. “Transductive inference” (train on all possible labelings of the test set, choose min. D/r) should also help performance
  3. Can get conditional probabilities out of SVMs by solving 1-dim integral equation.

G. Wahba (U. Wisconsin at Madison)

Reproducing kernel spaces, smoothing spline ANOVA spaces, SV machines, and all that (Theory):

  1. There is a 1-1 correspondence between reproducing kernel hilbert spaces (RKHS), and positive definite functions; and between positive definite functions, and zero-mean Gaussian stochastic processes.
  2. Replacing the logit cost functional, in penalized log-likelihood methods, gives several SVMs – hence there is relation between old techniques and SVMs (this was a recurring theme).
  3. For classification, replacing the logit likelihood functional by SVM functional is a natural way to concentrate on estimating logit near the classification boundary.

L. Kaufman (Bell Labs, Lucent)

Solving the QP problem arising in SV classification (Applications):

  1. For SVM training, there are methods for computing only that part of the hessian that you need. The idea is to solve a sequence of equality constrained problems.

C. Burges (Bell Labs, Lucent)

The Geometry of support vector machines (Theory):

  1. The SVM mapping induces a natural Riemannian metric on the data
  2. The kernel contains more information than the corresponding metric
  3. This observation gives useful tests for positivity of any proposed SVM kernel

B. Scholkopf (Max Planck at Tubingen, GMD)

Kernel principal component analysis (Theory + Applications):

Can use the kernel mapping trick for any algorithm that depends only on dot products of the data: can apply this observation to construct a nonlinear version of PCA.

Y. Freund (AT&T Research)

AdaBoost as a procedure for margin maximization (Theory + Applications):

  1. Naturally occurring norms in AdaBoost are \(|x|_\infty\) and \(|w|_1\) (x is data, w margin vector); for SVMs they are \(|x|_2\), \(|w|_2\).
  2. Corresponding bounds imply that AdaBoost will beat SVMs when most features are irrelevant, and vice versa.

K. Mueller (GMD FIRST, Berlin)

Predicting Time Series with SVMs (Applications):

  1. SVMs give better regression results than other systems on noisy data. In particular, they give equal or better performance than RBFs (despite fixed widths) on Mackey-Glass time series, and much better performance (29%) than all other published results on the Santa Fe time series competition, data set “D”.
  2. SVM classification is currently the best in the world for charmed quark detection.

T. Joachims (U. Dortmund)

SV machines for text categorization (Applications):

SVMs give much better results, and are much more robust, than other methods tried, for a text categorization problem.

F. Girosi (CBCL, MIT)

SVMs, regularization theory, and sparse approximation (Theory):

A modified version of the Basis Pursuit De-Noising algorithm is equivalent to an SVM. RKHS play a fundamental role in SVMs.

K. Bennett (Rensselaer Polytechnic Institute)

Combining SV methods and mathematical programming methods for induction:

  1. MPM has a rich history: MPM and SVM use same canonical form, and are easily combined
  2. MP could itself benefit from learning theory

M. Stitson (Royal Holloway London)

SV ANOVA decomposition (Theory + Applications):

ANOVA decomposition splines kernels give better results than regular splines and polynomial kernels on the Boston Housing regression problem.

J. Weston (Royal Holloway London)

Density estimation using SV machines (Theory):

  1. Can modify the SVM algorithm to estimate a density (differences: this results in LP problem, and kernels are not Mercer kernels)
  2. Showed a natural way to use kernels of different widths.

A. Smola (GMD FIRST, Berlin)

General cost functions for SV regression (Theory):

  1. By using path following interior point methods for SVM training, can extend class of cost functions
  2. This is useful because should match cost function to noise, where possible, for best performance.

J. Shawe-Taylor (Royal Holloway London)

Data-sensitive PAC analysis for SV and other machines (Theory):

PAC bounds can be used to put SRM on a rigorous, data-dependent footing for SV machines.

N. Cristianini (U. Bristol)

Bayesian voting schemes and large margin algorithms (Theory):

Bayesian learning algorithms can be regarded as large margin hyperplane algorithms in high dimensional feature space, where the margin depends on the prior.

D. Schuurmans (U. Pennsylvania & NEC Research)

Improving and generalizing the basic maximum margin algorithm (Theory):

  1. \(L_1\) (as opposed to \(L_2\)) margins will be a win if target vector is sparse, e.g. text classification.
  2. Can use unlabeled data, which gives metric on hypothesis space, to improve generalization performance.