Course on Machine Learning with Kernels
(Slides: Day 1,
Day 2)
Singapore, October 4-5, 2004 and Kuala Lumpur, October 6-7, 2004
Alex Smola
National ICT Australia, Machine Learning Program, Canberra Laboratory
Day 1
Lecture 1: Introduction to Machine Learning and Probability Theory
We introduce the concept of machine learning, as it is used to solve
problems of pattern recognition, classification, regression, novelty
detection and data cleaning. Subsequently we give a primer on
probabilities, Bayes rule and inference (hypothesis testing for disease
diagnosis).
Lecture 2: Density Estimation and Parzen Windows
We begin with a simple density estimator, Parzen Windows, which can be
implemented very easily to perform estimation, as it requires
essentially no algorithm to run before it can be used. A simple rule is
given how to tune the parameters of the estimator, we discuss
Silverman's rule, the Watson-Nadaraya Estimator for classification and
regression, we discuss crossvalidation. Examples and applications
conclude this lecture.
Lecture 3: The Perceptron and Kernels
A slightly more complex classifier is the Perceptron which produces
linear separation of sets. We explain the algorithm, show its properties
and implementation details. Subsequently we modify the algorithm to
allow for nonlinear separation and multiclass discrimination. This leads
us naturally to introduce kernels. Examples of kernels are given (more
details on that on day 2 of the course).
Lecture 4: Support Vector Classification
Support Vector Machines are a more sophisticated method for solving the
classification problem. We describe the optimization problem they solve
and show their geometrical properties.
Day 2
Lecture 1: Kernel Methods for Text Categorization and Biological
Sequence Analysis
We describe the problem of text categorization, explain kernels on texts
and biological sequences and show how they can be computed
efficiently. We give practical examples for Remote Homolgy Detection and
the Reuters database.
Lecture 2: Optimization
If one wants to implement SVMs oneself (or also to control them better)
one needs to understand how the optimization problem is solved. After a
short primer on convex optimization methods we explain chunking and
Sequential Minimal Optimization. We conclude with more advanced methods
(yet easy to implement), such as online learning with kernels.
Lecture 3: Regression and Novelty Detection
We continue with a description of methods for regression with kernels,
namely the classical SV regression and regularized least mean squares
regression. Implementation details are given. Subsequently we discuss
kernel methods for novelty detection and database cleaning.
Lecture 4: How to get good results in practice
Clearly one of the key issues is how to obtain good results in
practice. The course concludes with a bag of important practical tricks,
such as the nu-trick for adjusting the regularization parameter, the
median trick for adjusting the kernel, how to use cross-validation in
practice, how to scale the data before optimization, and how to
interpret more advanced issues such as the spectrum of the kernel matrix
and the smoothness of the kernel itself.
Prerequisites
Singapore, October 4-5, 2004 and Kuala Lumpur, October 6-7, 2004
Alex SmolaNational ICT Australia, Machine Learning Program, Canberra Laboratory
Day 1
Lecture 1: Introduction to Machine Learning and Probability Theory
We introduce the concept of machine learning, as it is used to solve problems of pattern recognition, classification, regression, novelty detection and data cleaning. Subsequently we give a primer on probabilities, Bayes rule and inference (hypothesis testing for disease diagnosis).Lecture 2: Density Estimation and Parzen Windows
We begin with a simple density estimator, Parzen Windows, which can be implemented very easily to perform estimation, as it requires essentially no algorithm to run before it can be used. A simple rule is given how to tune the parameters of the estimator, we discuss Silverman's rule, the Watson-Nadaraya Estimator for classification and regression, we discuss crossvalidation. Examples and applications conclude this lecture.Lecture 3: The Perceptron and Kernels
A slightly more complex classifier is the Perceptron which produces linear separation of sets. We explain the algorithm, show its properties and implementation details. Subsequently we modify the algorithm to allow for nonlinear separation and multiclass discrimination. This leads us naturally to introduce kernels. Examples of kernels are given (more details on that on day 2 of the course).Lecture 4: Support Vector Classification
Support Vector Machines are a more sophisticated method for solving the classification problem. We describe the optimization problem they solve and show their geometrical properties.Day 2
Lecture 1: Kernel Methods for Text Categorization and Biological Sequence Analysis
We describe the problem of text categorization, explain kernels on texts and biological sequences and show how they can be computed efficiently. We give practical examples for Remote Homolgy Detection and the Reuters database.Lecture 2: Optimization
If one wants to implement SVMs oneself (or also to control them better) one needs to understand how the optimization problem is solved. After a short primer on convex optimization methods we explain chunking and Sequential Minimal Optimization. We conclude with more advanced methods (yet easy to implement), such as online learning with kernels.Lecture 3: Regression and Novelty Detection
We continue with a description of methods for regression with kernels, namely the classical SV regression and regularized least mean squares regression. Implementation details are given. Subsequently we discuss kernel methods for novelty detection and database cleaning.Lecture 4: How to get good results in practice
Clearly one of the key issues is how to obtain good results in practice. The course concludes with a bag of important practical tricks, such as the nu-trick for adjusting the regularization parameter, the median trick for adjusting the kernel, how to use cross-validation in practice, how to scale the data before optimization, and how to interpret more advanced issues such as the spectrum of the kernel matrix and the smoothness of the kernel itself.Prerequisites
Nothing beyond undergraduate knowledge in mathematics is expected. More specifically, I assume:
- Basic linear algebra (matrix inverse, eigenvector, eigenvalue, etc.)
- Some numerical mathematics (beenficial but not required), such as matrix factorization, conditioning, etc.
- Basic statistics and probability theory (Normal distribution, conditional distributions).
- (OPTIONAL:) Some knowledge in Bayesian methods
- (OPTIONAL:) Some knowledge in kernel methods