Summaries
V. Vapnik (AT&T Research):
The SVM method for function approximation (Theory):
Will get better SVM performance if find kernels that minimize D/r,
where D is diameter of minimal sphere containing data in hi-dim space, and r
is the margin
“Transductive inference” (train on all possible labelings of the test set,
choose min. D/r) should also help performance
Can get conditional probabilities out of SVMs by solving 1-dim integral
equation.
G. Wahba (U. Wisconsin at Madison):
Reproducing kernel spaces, smoothing spline ANOVA spaces, SV machines, and all
that (Theory):
There is a 1-1 correspondence between reproducing kernel hilbert spaces
(RKHS), and positive definite functions; and between positive definite
functions, and zero-mean Gaussian stochastic processes.
Replacing the logit cost functional, in penalized log-likelihood methods,
gives several SVMs - hence there is relation between old techniques and SVMs
(this was a recurring theme).
For classification, replacing the logit likelihood functional by SVM
functional is a natural way to concentrate on estimating logit near the
classification boundary.
L. Kaufman (Bell Labs, Lucent):
Solving the QP problem arising in SV classification (Applications):
For SVM training, there are methods for computing only that part of the
hessian that you need. The idea is to solve a sequence of equality constrained
problems. Can project variables into a subspace in two interesting ways:
such that the resulting hessian's inverse can be computed very efficiently;
this requires more storage;
such that hessian itself is not needed - can
compute gradient very efficiently; needs lees storage, more compute power.
C. Burges (Bell Labs, Lucent):
The Geometry of support vector machines (Theory):
The SVM mapping induces a natural Riemannian metric on the data
The kernel contains more information than the corresponding metric
This observation gives useful tests for positivity of any proposed SVM
kernel
B. Scholkopf (Max Planck at Tuebingen, GMD):
Kernel principal component analysis (Theory + Applications):
Can use the kernel mapping trick for any algorithm that depends only on dot
products of the data: can apply this observation to construct a nonlinear
version of PCA
Y. Freund (AT&T Research):
AdaBoost as a procedure for margin maximization (Theory + Applications):
Naturally occurring norms in AdaBoost (the best-performing boosting
algorithm known to date) are and (x is data, w margin
vector); for SVMs they are , .
Corresponding bounds imply that AdaBoost will beat SVMs when most features
are irrelevant, and vice versa.
K. Mueller (GMD FIRST, Berlin):
Predicting Time Series with SVMs (Applications):
SVMs give better regression results than other systems on noisy data. In
particular, they give equal or better performance than RBFs (despite fixed
widths) on Mackey-Glass times series, and much better performance (29%) than all
other published results on the Sante Fe time series competition, data set “D”.
SVM classification is currently the best in the world for charmed quark detection.
T. Joachims (U. Dortmund):
SV machines for text categorization (Applications):
SVMs give much better results, and are much more robust, than other methods
tried, for a text categorization problem.
F. Girosi (CBCL, MIT):
SVMs, regularization theory, and sparse approximation (Theory):
A modified version of the Bais Pursuit De-Noising algorithm is equivalent
to an SVM. RKHS play a fundamental role in SVMs.
K. Bennett (Rensselaer Polytechnic Institute):
Combining SV methods and mathematical programming methods for induction
MPM has a rich history: MPM and SVM use same canonical form, and are easily
combined
MP could itself benefit from learning theory
M. Stitson (Royal Holloway London):
SV ANOVA decomposition (Theory + Applications)
ANOVA decomposition splines kernels give better results than regular
splines and polynomial kernels on the Boston Housing regression problem.
J. Weston (Royal Holloway London):
Density estimation using SV machines (Theory)
Can modify the SVM algorithm to estimate a density (differences:
this results in LP problem, and kernels are not Mercer kernels)
Showed a natural way to use kernels of different widths.
A. Smola (GMD FIRST, Berlin):
General cost functions for SV regression (Theory)
By using path following interior point methods for SVM training, can extend
class of cost functions
This is useful because should match cost function to noise, where possible,
for best performance.
J. Shawe-Taylor (Royal Holloway London):
Data-sensitive PAC analysis for SV and other machines (Theory)
PAC bounds can be used to put SRM on a rigorous, data-dependent footing for
SV machines.
N. Christianini (U. Bristol):
Bayesian voting schemes and large margin algorithms (Theory)
Bayesian learning algorithms can be regarded as large margin hyperplane
algorithms in high dimensional feature space, where the margin depends on the
prior.
D. Schuurmans (U. Pennsylvania & NEC Research):
Improving and generalizing the basic maximum margin algorithm (Theory)
(as opposed to ) margins will be a win if target vector is sparse,
e.g. text classification.
Can use unlabeled data, which gives metric on hypothesis space, to improve
generalization performance.
|