# Courses

## Practical Machine Learning CS329P 2021, Stanford

Alex Smola, Qingqing Huang, and Mu Li

Applying Machine Learning (ML) to solve real problems accurately and robustly requires more than just training the latest ML model. First, you will learn practical techniques to deal with data. This matters since real data is often not independently and identically distributed. It includes detecting covariate, concept, and label shifts, and modeling dependent random variables such as the ones in time series and graphs. Next, you will learn how to efficiently train ML models, such as tuning hyper-parameters, model combination, and transfer learning. Last, you will learn about fairness and model explainability, and how to efficiently deploy models. This is a full semester class that teaches both statistics, algorithms and code implementations. Homeworks and the final project emphasize solving real problems.

## Introduction to Deep Learning STAT 157 2019, UC Berkeley

Alex Smola and Mu Li

This is a full semester course providing a practical introduction to deep learning, including theoretical motivations and how to implement it in practice. As part of the course we will cover multilayer perceptrons, backpropagation, automatic differentiation, and stochastic gradient descent. Moreover, we introduce convolutional networks for image processing, starting from the simple LeNet to more recent architectures such as ResNet for highly accurate models. Secondly, we discuss sequence models and recurrent networks, such as LSTMs, GRU, and the attention mechanism. Throughout the course we emphasize efficient implementation, optimization and scalability, e.g. to multiple GPUs and to multiple machines. The goal of the course is to provide both a good understanding and good ability to build modern nonparametric estimators. The entire course is based on Jupyter notebooks to allow you to gain experience quickly.

## Introduction to Machine Learning 10-701 2015, CMU

Machine learning studies the question “how can we build computer programs that automatically improve their performance through experience?” This includes learning to perform many types of tasks based on many types of experience. For example, it includes robots learning to better navigate based on experience gained by roaming their environments, medical decision aids that learn to predict which therapies work best for which diseases based on data mining of historical health records, and speech recognition systems that learn to better understand your speech based on experience listening to you.

This course is designed to give PhD students a thorough grounding in the methods, theory, mathematics and algorithms needed to do research and applications in machine learning. The topics of the course draw from machine learning, classical statistics, data mining, Bayesian statistics and information theory. Students entering the class with a pre-existing working knowledge of probability, statistics and algorithms will be at an advantage, but the class has been designed so that anyone with a strong numerate background can catch up and fully participate.

## Advanced Optimization and Randomized Methods 10-801 2014, CMU

Alex Smola and Suvrit Sra

This course will cover a variety of topics from optimization (convex, nonconvex, continuous and combinatorial) as well as streaming algorithms. The key aim of the course is to make the students aware of powerful algorithmic tools that are used for tackling large-scale data intesive problems. The topics covered are chosen to give the students a solid footing for research in machine learning and optimization, while strengthening their practical grasp. In addition to foundational classical theory, the course will also cover some cutting-edge material due to the rapidly evolving nature of large-scale optimization.

## Introduction to Machine Learning 10-701x 2013, CMU

Alex Smola and Barnabas Poczos

Machine learning studies the question “how can we build computer programs that automatically improve their performance through experience?” This includes learning to perform many types of tasks based on many types of experience. For example, it includes robots learning to better navigate based on experience gained by roaming their environments, medical decision aids that learn to predict which therapies work best for which diseases based on data mining of historical health records, and speech recognition systems that learn to better understand your speech based on experience listening to you.

This course is designed to give PhD students a thorough grounding in the methods, theory, mathematics and algorithms needed to do research and applications in machine learning. The topics of the course draw from machine learning, classical statistics, data mining, Bayesian statistics and information theory. Students entering the class with a pre-existing working knowledge of probability, statistics and algorithms will be at an advantage, but the class has been designed so that anyone with a strong numerate background can catch up and fully participate. If you are interested in this topic, but are not a PhD student, or are a PhD student not specializing in machine learning, you might consider Roni Rosenfeld’s master’s level course on Machine Learning, 10-601.

## Introduction to Machine Learning 10-701 2013, CMU

Machine learning studies the question “how can we build computer programs that automatically improve their performance through experience?” This includes learning to perform many types of tasks based on many types of experience. For example, it includes robots learning to better navigate based on experience gained by roaming their environments, medical decision aids that learn to predict which therapies work best for which diseases based on data mining of historical health records, and speech recognition systems that learn to better understand your speech based on experience listening to you.

This course is designed to give PhD students a thorough grounding in the methods, theory, mathematics and algorithms needed to do research and applications in machine learning. The topics of the course draw from machine learning, classical statistics, data mining, Bayesian statistics and information theory. Students entering the class with a pre-existing working knowledge of probability, statistics and algorithms will be at an advantage, but the class has been designed so that anyone with a strong numerate background can catch up and fully participate.

## Scalable Machine Learning CS 281B 2012, UC Berkeley

Scalable Machine Learning occurs when Statistics, Systems, Machine Learning and Data Mining are combined into flexible, often nonparametric, and scalable techniques for analyzing large amounts of data at internet scale. This class aims to teach methods which are going to power the next generation of internet applications. The class will cover systems and processing paradigms, an introduction to statistical analysis, algorithms for data streams, generalized linear methods (logistic models, support vector machines, etc.), large scale convex optimization, kernels, graphical models and inference algorithms such as sampling and variational approximations, and explore/exploit mechanisms. Applications include social recommender systems, real time analytics, spam filtering, topic models, and document analysis.

## Machine Learning with Exponential Families 2004, Australian National University

Stephane Canu, Vishy Vishwanathan and Alex Smola

Machine learning is an exciting new subject dealing with automatic recognition of patterns (e.g. automatic recognition of faces, postal codes on envelopes, speech recognition), prediction (e.g. predicting the stock market), datamining (e.g. finding good customers, fraudulent transactions), applications in bioinformatics (finding relevant genes, annotating the genome, processing DNA Microarray images), or internet-related problems (spam filtering, searching, sorting, network security). It is becoming a key area for technical advance and people skilled in this area are worldwide in high demand.

This unit introduces the fundamentals of machine learning, based on the unifying framework of exponential families. The course requires mathematical and computer skills. It will cover linear algebra and numerical mathematics techniques, linear classification, regression, mathematical programming, the Perceptron, Graphical Models, Boosting, density estimation, conditional random fields, Regression methods, Kernels and Regularization.

## Introduction to Machine Learning, SISE9128 200, Australian National University

This course is very similar to ENGN4520.

## Introduction to Machine Learning, ENGN4520 2001, Australian National University

Machine learning is an exciting new subject dealing with automatic recognition of patterns (e.g. automatic recognition of faces, postal codes on envelopes, speech recognition), prediction (e.g. predicting the stock market), datamining (e.g. finding good customers, fraudulent transactions), applications in bioinformatics (finding relevant genes, annotating the genome, processing DNA Microarray images), or internet-related problems (spam filtering, searching, sorting, network security). It is becoming a key area for technical advance and people skilled in this area are worldwide in high demand.

# Tutorials

## Dive into Deep Learning in 1 Day, Open Data Science Conference 2019

Did you ever want to find out about deep learning but didn’t have time to spend months? New to machine learning? Do you want to build image classifiers, NLP apps, train on many GPUs or even on many machines? If you’re an engineer or data scientist, this course is for you. This is about the equivalent of a Coursera course, all packed into one day.

## Attention in Deep Learning

Attention is a key mechanism to enable nonparametric models in deep learning. Quite arguably it is the basis of most recent progress in deep learning models. Beyond its introduction in neural machine translation, it can be traced back to neuroscience. It was arguably introduced via the gating or forgetting mechanism of LSTMs. Over the past 5 years attention has been key to advancing the state of the art in areas as diverse as natural language processing, computer vision, speech recognition, image synthesis, solving traveling salesman problems, or reinforcement learning. This tutorial offers a coherent overview over various types of attention; efficient implementation using Jupyter notebooks which allow the audience a hands-on experience to replicate and apply attention mechanisms; and a textbook (www.d2l.ai) to allow the audience to dive more deeply into the underlying theory. Slides can be found in Keynote and PDF

## NLP with Apache MXNet

Leonard Lausen, Haibin Lin and Alex Smola

While deep learning has rapidly emerged as the dominant approach to training predictive models for large-scale machine learning problems, these algorithms push the limits of available hardware, requiring specialized frameworks optimized for GPUs and distributed cloud-based training. Moreover, especially in natural language processing (NLP), models contain a variety of moving parts: character-based encoders, pre-trained word embeddings, long-short term memory (LSTM) cells, and beam search for decoding sequential outputs, among others.

This tutorial introduces GluonNLP, a powerful new toolkit that combines MXNet’s speed, the user-friendly Gluon frontend, and an extensive new library automating the most painful aspects of deep learning for NLP. In this full-day tutorial, we will start off with a crash course on deep learning with Gluon, covering data, autodiff, and deep (convolutional and recurrent) neural networks. Then we’ll dive into GluonNLP, demonstrating how to work with word embeddings (both pre-trained and from scratch), language models, and the popular Transformer model for machine translation.

## Scaling Machine Learning, AAAI 2014

Amr Ahmed and Alex Smola

This tutorial discusses machine learning and systems aspects for big data analytics. In particular, we will given an overview of modern distributed processing systems and sources of large amounts of data. We will discuss the parameter server concept in considerable detail and then show how it can be applied to solve a variety of large scale learning problems ranging from Terascale convex optimization, topic modeling on billions of users, generative models for mixed datatypes such as geotagged microblogs, and factorization and recommender systems.

## Machine Learning for Computational Advertising, UC Santa Cruz 2009

We cover basic probability theory, instance based learning and the perceptron.

## Introduction to Machine Learning, Tata TRDC 2007

This tutorial covers basic ML, density estimation, perceptron, SVMs, regression and structured estimation.

## Machine Learning with Kernels, ICONIP 2006

We introduce exponential families and show how they can be used for modelling a large range of distributions important for supervised learning. In particular we will discuss multinomial and Gaussian families. Moreover, we show how optimization problems are solved in the case of normal priors. Finally, we discuss connections to graphical models and message passing.

By conditioning on location we extend exponential family models into state of the art multiclass classification and regression estimators. In addition, we will discuss conditional random fields, which are used for document annotation and named entity tagging.

Operator methods are useful to test for identity between distributions. We will discuss a very simple and easily implementable criterion for such tests. Applications to data integration are discussed. We also discuss applications to covariate shift correction, that is, cases where training and test set are drawn from different distributions. Lastly we extend this to dependency estimation.

## Machine Learning with Kernels, Singapore & Malaysia 2004

This is a two-day course on kernels.

## Bayesian Kernel Methods, MLSS 2002

The course begins with an overview of the basic assumptions underlying Bayesian estimation. We explain the notion of prior distributions, which encode our prior belief concerning the likelihood of obtaining a certain estimate, and the concept of the posterior probability, which quantifies how plausible functions appear after we observe some data. Subsequently we show how inference is performed, and how certain numerical problems that arise can be alleviated by various types of Maximum-a-Posteriori (MAP) estimation.

Once the basic tools are introduced, we analyze the specific properties of Bayesian estimators for three different types of prior probabilities: Gaussian Processes (which includes a description of the theory and efficient means of implementation), which rely on the assumption that adjacent coefficients are correlated, Laplacian Processes, which assume that estimates can be expanded into a sparse linear combination of kernel functions, and therefore favor such hypotheses, and Relevance Vector Machines, which assume that the contribution of each kernel function is governed by a normal distribution with its own variance.

## Tutorial on Bayesian Kernel Methods, ICML 2002

The tutorial will introduce Gaussian Processes both for Classifcation and Regression. This includes a brief presentation of covariance functions, their connection to Support Vector Kernels, and an overview over recent optimization methods for Gaussian Processes.

Target Audience: Novices and researchers more advanced in the knowledge of Gaussian Processes will benefit from the presentation. While being self contained, i.e., without requiring much further knowledge than basic calculus and linear algebra, the presentation will advance to state of the art results in optimization and adaptive inference. This means that the course will cater for Graduate Students and senior researchers alike. In particular, I will not assume knowledge beyond undergraduate mathematics.

## Tutorial on Support Vector Machines, ISCAS 2001

Support Vector Machines and related Bayesian kernel methods such as Gaussian Processes or the Relevance Vector Machines have been deployed successfully in classification and regression tasks. They work by mapping the data into a high-dimensional feature space and compute linear functions on the features. This has the appeal of being easily accessible to optimization and theoretical analysis.

The algorithmic advantage is that the optimization problems resulting from Support Vector Machines have a global minimum and that they can be solved with standard quadratic programming tools. Furthermore, the parametrization of kernel methods tends to be rather intuitive for the user. In this tutorial, I will introduce the basic theory of Support Vector Machines and some recent extensions. Moreover, I will present a few simple algorithms to solve the optimization problems in practice.

## Tutorial on Support Vector Machines, ICANN 2001

This tutorial is very similar to the ISCAS 2001 tutorial.

# Workshops

## NIPS 2015: Nonparametric Methods for Large Scale Representation Learning

This 1 day workshop is about non-parametric methods for large scale structure learning, including automatic pattern discovery, extrapolation, manifold learning, kernel learning, metric learning, data compression, feature extraction, trend filtering, and dimensionality reduction. Non-parametric methods include, for example, Gaussian processes, Dirichlet processes, Indian buffet processes, and support vector machines. We are particularly interested in developing scalable and expressive methods to derive new scientific insights from large datasets. A poster session, coffee breaks, and a panel guided discussion will encourage interaction between attendees. This workshop aims to bring together researchers wishing to explore alternatives to neural networks for learning rich non-linear function classes, with an emphasis on nonparametric methods, representation learning and scalability. We wish to carefully review and enumerate modern approaches to these challenges, share insights into the underlying properties of these methods, and discuss future directions.

## NIPS 2013: Modern Nonparametric Methods in Machine Learning

Arthur Gretton, Mladen Kolar, Samory Kpotufe, John Lafferty, Han Liu, Bernhard Schölkopf, Alex Smola, Rob Nowak, Mikhail Belkin, Lorenzo Rosasco, Peter Bickel, Yue Zhao

Modern data acquisition routinely produces massive and complex datasets. Examples are data from high throughput genomic experiments, climate data from worldwide data centers, robotic control data collected overtime in adversarial settings, user-behavior data from social networks, user preferences on online markets, and so forth. Modern pattern recognition problems arising in such disciplines are characterized by large data sizes, large number of observed variables, and increased pattern complexity. Therefore, nonparametric methods which can handle generally complex patterns are ever more relevant for modern data analysis. However, the larger data sizes and number of variables constitute new challenges for nonparametric methods in general. The aim of this workshop is to bring together both theoretical and applied researchers to discuss these modern challenges in detail, share insight on existing solutions, and lay out some of the important future directions.

Through a number of invited and contributed talks and a focused panel discussion, we plan to emphasize the importance of nonparametric methods and present challenges for modern nonparametric methods.

## NIPS 2013: Randomized Methods for Machine Learning

As we enter the era of big-data, Machine Learning algorithms that resort in heavy optimization routines rapidly become prohibitive. Perhaps surprisingly, randomization (Raghavan and Motwani, 1995) arises as a computationally cheaper, simpler alternative to optimization that in many cases leads to smaller and faster models with little or no loss in performance. Although randomized algorithms date back to the probabilistic method (Erdős, 1947, Alon & Spencer, 2000), these techniques only recently started finding their way into Machine Learning. The most notable exceptions are stochastic methods for optimization and Markov Chain Monte Carlo methods, both of which have become well-established in the past two decades. This workshop aims to accelerate this process by bringing together researchers in this area and exposing them to recent developments. The targeted audience are researchers and practitioners looking for scalable, compact and fast solutions to learn in the large-scale setting.

## NIPS 2012: Confluence between Kernel Methods and Graphical Models

Kernel methods and graphical models are two important families of techniques for machine learning. Our community has witnessed many major but separate advances in the theory and applications of both subfields. For kernel methods, the advances include kernels on structured data, Hilbert-space embeddings of distributions, and applications of kernel methods to multiple kernel learning, transfer learning, and multi-task learning. For graphical models, the advances include variational inference, nonparametric Bayes techniques, and applications of graphical models to topic modeling, computational biology and social network problems.

This workshop addresses two main research questions: first, how may kernel methods be used to address difficult learning problems for graphical models, such as inference for multi-modal continuous distributions on many variables, and dealing with non-conjugate priors? And second, how might kernel methods be advanced by bringing in concepts from graphical models, for instance by incorporating sophisticated conditional independence structures, latent variables, and prior information?

Kernel algorithms have traditionally had the advantage of being solved via convex optimization or eigenproblems, and having strong statistical guarantees on convergence. The graphical model literature has focused on modelling complex dependence structures in a flexible way, although approximations may be reuqired to make inference tractable. Can we develop a new set of methods which blend these strengths?

There has recently been a number of publications combining kernel and graphical model techniques, including kernel hidden Markov models, kernel belief propagation, kernel Bayes rule, kernel topic models, kernel variational inference, kernel herding as Bayesian quadrature, kernel beta processes, and a connection between kernel k-means and Bayesian nonparametrics. Each of these results deals with different inference tasks, and makes use of a range of RKHS propreties. We propose this workshop so as to “connect the dots” and develop a unified toolkit to address a broad range of learning problems, to the mutual benefit of reseachers in kernels and graphical models. The goals of the workshop are thus twofold: first, to provide an accessible review and synthesis of recent results combining graphical models and kernels. Second, to provide a discussion forum for open problems and technical challenges.

## NIPS 2011: Algorithms, Systems, and Tools for Learning at Scale

This workshop will address tools, algorithms, systems, hardware, and real-world problem domains related to large-scale machine learning (“Big Learning”). The Big Learning setting has attracted intense interest with active research spanning diverse fields including machine learning, databases, parallel and distributed systems, parallel architectures, and programming languages and abstractions. This workshop will bring together experts across these diverse communities to discuss recent progress, share tools and software, identify pressing new challenges, and to exchange new ideas. Topics of interest include (but are not limited to):

**Hardware Accelerated Learning:**Practicality and performance of specialized high-performance hardware (e.g. GPUs, FPGAs, ASIC) for machine learning applications.**Applications of Big Learning:**Practical application case studies; insights on end-users, typical data workflow patterns, common data characteristics (stream or batch); trade-offs between labeling strategies (e.g., curated or crowd-sourced); challenges of real-world system building.**Tools, Software, & Systems:**Languages and libraries for large-scale parallel or distributed learning. Preference will be given to approaches and systems that leverage cloud computing (e.g. Hadoop, DryadLINQ, EC2, Azure), scalable storage (e.g. RDBMs, NoSQL, graph databases), and/or specialized hardware (e.g. GPU, Multicore, FPGA, ASIC).**Models & Algorithms:**Applicability of different learning techniques in different situations (e.g., simple statistics vs. large structured models); parallel acceleration of computationally intensive learning and inference; evaluation methodology; trade-offs between performance and engineering complexity; principled methods for dealing with large number of features;

*Unfortunately the website is no longer live.*

## NIPS 2010: Challenges of Data Visualization

The increasing amount and complexity of electronic data sets turns visualization into a key technology to provide an intuitive interface to the information. Unsupervised learning has developed powerful techniques for, e.g., manifold learning, dimensionality reduction, collaborative filtering, and topic modeling. However, the field has so far not fully appreciated the problems that data analysts seeking to apply unsupervised learning to information visualization are facing such as heterogeneous and context dependent objectives or streaming and distributed data with different credibility. Moreover, the unsupervised learning field has hitherto failed to develop human-in-the-loop approaches to data visualization, even though such approaches including e.g. user relevance feedback are necessary to arrive at valid and interesting results.

As a consequence, a number of challenges arise in the context of data visualization which cannot be solved by classical methods in the field:

Methods have to deal with modern data formats and data sets: How can the technologies be adapted to deal with streaming and probably non i.i.d. data sets? How can specific data formats be visualized appropriately such as spatio-temporal data, spectral data, data characterized by a general probably non-metric dissimilarity measure, etc.? How can we deal with heterogeneous data and different credibility? How can the dissimilarity measure be adapted to emphasize the aspects which are relevant for visualization?

Available techniques for specific tasks should be combined in a canonic way: How can unsupervised learning techniques be combined to construct good visualizations? For instance, how can we effectively combine techniques for clustering, collaborative filtering, and topic modeling with dimensionality reduction to construct scatter plots that reveal the similarity between groups of data, movies, or documents? How can we arrive at context dependent visualization?

Visualization techniques should be accompanied by theoretical guarantees: What are reasonable mathematical specifications of data visualization to shape this inherently ill-posed problem? Can this be controlled by the user in an efficient way? How can visualization be evaluated? What are reasonable benchmarks? What are reasonable evaluation measures?

Visualization techniques should be ready to use for users outside the field: Which methods are suited to users outside the field? How can the necessity be avoided to set specific technical parameters by hand or choose from different possible mathematical algorithms by hand? Can this necessity be substituted by intuitive interactive mechanisms which can be used by non-experts?

## SIGIR 2010: Feature Generation and Selection for Information Retrieval

- Evgeniy Gabrilovich, Yahoo! Research
- Alex Smola, Yahoo! Research
- Tali Tishby, Hebrew University

Modern information retrieval systems facilitate information access at unprecedented scale and level of sophistication. However, in many cases the underlying representation of text remains quite simple, often limited to using a weighted bag of words. Over the years, several approaches to automatic feature generation have been proposed (such as Latent Semantic Indexing, Hashing, or Latent Dirichlet Allocation), yet their application in large scale systems still remains the exception rather than the rule. On the other hand, numerous studies in NLP and IR resort to manually crafting features, which is a laborious and often computationally expensive process. Such studies often focus on one specific problem, and consequently many features they define are task- or domain-dependent. Consequently little knowledge transfer is possible to other problem domains. This limits our understanding of how to reliably construct informative features for new tasks.

## NIPS 2009: Large-Scale Machine Learning

- Carlos Guestrin, CMU
- Alex Gray, Georgia Tech
- Alex Smola, Yahoo! Research
- Arthur Gretton, CMU
- Joseph Gonzalez, CMU

Physical and economic limitations have forced computer architecture towards parallelism and away from exponential frequency scaling. Meanwhile, increased access to ubiquitous sensing and the web has resulted in an explosion in the size of machine learning tasks. In order to benefit from current and future trends in processor technology we must discover, understand, and exploit the available parallelism in machine learning. This workshop will achieve four key goals:

- Bring together people with varying approaches to parallelism in machine learning to identify tools, techniques, and algorithmic ideas which have lead to successful parallel learning.
- Invite researchers from related fields, including parallel algorithms, computer architecture, scientific computing, and distributed systems, who will provide new perspectives to the NIPS community on these problems, and may also benefit from future collaborations with the NIPS audience.
- Identify the next key challenges and opportunities to parallel learning.
- Discuss large-scale applications, e.g., those with real time demands, that might benefit from parallel learning.

## NIPS 2007: Representations and Inference on Probability Distributions

- Kenji Fukumizu, Institute of Statistical Mathematics
- Alex Smola, Australian National University
- Arthur Gretton, CMU

When dealing with distributions it is in general infeasible to estimate them explicitly in high dimensional settings, since the associated learning rates can be arbitrarily slow. On the other hand, a great variety of applications in machine learning and computer science require distribution estimation andor comparison. Examples include testing for homogeneity (the “two-sample problem”), independence, and conditional independence, where the last two can be used to infer causality; data set squashing data sketching / data anonymisation; domain adaptation (the transfer of knowledge learned on one domain to solving problems on another, related domain) and the related problem of covariate shift; message passing in graphical models (EP and related algorithms); compressed sensing; and links between divergence measures and loss functions.

## NIPS 2004: Graphical Models and Kernels

- Alex Smola, Australian National University
- Ben Taskar, U Pennsylvania
- Vishy Vishwanathan, Purdue University

Graphical models provide a natural method to model variables with structured conditional independence properties. They allow for understandable description of models, making them a popular tool in practice. Kernel methods excel at modeling data which need not be structured at all, by using mappings into high-dimensional spaces (also popularly called the kernel trick). The popularity of kernel methods is primarily due to their strong theoretical foundations and the relatively simple convex optimization problems.

Recent progress towards a unification of the two areas has seen work on Maximum Margin Markov Networks, structured output spaces, and kernelized Conditional Random Fields. Some work has also been done on using fundamental properties of the exponential family of probability distributions to establish links.

The aim of this workshop is to bring together researchers from both the communities together in order to facilitate interactions. More specifically, the issues we want to address include (but are not limited to), the fundamental theory linking these fields. We want to investigate connections using exponential families, conditional random fields, Markov models etc. We also wish to explore the applications of the kernel trick to graphical models and study the optimization problems which arise out of such a marriage. Uniform convergence type results for theoretically bounding the performance of such models will also be discussed.

## NIPS 2002: Unreal Data — Principles of Modeling Nonvectorial Data

- Zoubin Ghahramani, Cambridge University
- Gunnar Rätsch, Friedrich Miescher Laboratory
- Alex Smola, Australian National University

A large amount of research in machine learning is concerned with classification and regression for real-valued data which can easily be embedded into a Euclidean vector space. This is in stark contrast with many real world problems, where the data is often a highly structured combination of features, a sequence of symbols, a mixture of different modalities, may have missing variables, etc. To address the problem of learning from non-vectorial data, various methods have been proposed, such as embedding the structures in some metric spaces, the extraction and selection of features, proximity based approaches, parameter constraints in Graphical Models, Inductive Logic Programming, Decision Trees, etc. The goal of this workshop is twofold. Firstly, we hope to make the machine learning community aware of the problems arising from domains where non-vectorspace data abounds and to uncover the pitfalls of mapping such data into vector spaces. Secondly, we will try to find a more uniform structure governing methods for dealing with non-vectorial data and to understand what, if any, are the principles underlying the modeling of non-vectorial data.

## ICANN 1999: Gaussian Processes and Support Vector Machines

- Carl Rasmussen, Cambridge University
- Roderick Murray-Smith, University of Glasgow
- Alex Smola, Yahoo! Research
- Chris Williams, University of Edinburgh

This workshop aims to bring together people working with Gaussian Process (GP) and Support Vector Machine (SVM) predictors for regression and classification problems. We will open with tutorial-like introductions to the basics so that researchers new to the area can gain an impression of the applicability of the approaches, and will follow with contributed presentations. The final part of the workshop will be an open discussion session. We would bring laptops to provide some software demos, and would encourage others to do the same.

## EUROCOLT 1999: Kernel Methods

- John Shawe-Taylor, University College London
- Bernhard Schoelkopf, MPI Tuebingen
- Alex Smola, Yahoo! Research
- Bob Williamson, NICTA and ANU

We are hosting a one day informal workshop on Sunday 28th March at Nordkirchen Castle, Germany, on the Sunday before the EuroCOLT’99 conference; particular interest of the organisers is the analysis of Kernels and Regularization and this will be one of the themes of the workshop.

The aim is to provide a meeting venue for those who are attending both the Dagstuhl meeting on unsupervised learning, ending on the 26th, and the EuroCOLT conference, starting on the 29th. Those not attending the Dagstuhl meeting are of course very welcome to participate, too. If you wish to attend, consider arriving on the Saturday evening when there will be a meeting to arrange the format of the day.

## NIPS 1998: Large Margin Classifiers

- Peter Bartlett, UC Berkeley
- Dale Schuurmans, U Alberta
- Bernhard Schoelkopf, MPI Tuebingen
- Alex Smola, Yahoo! Research

Many pattern classifiers are represented as thresholded real-valued functions, eg: sigmoid neural networks, support vector machines, voting classifiers, and Bayesian schemes. There is currently a great deal of interest in algorithms that produce classifiers of this kind with large margins, where the margin is the amount by which the classifier’s prediction is to the correct side of threshold. Recent theoretical and experimental results show that many learning algorithms (such as back-propagation, SVM methods, AdaBoost, and bagging) frequently produce classifiers with large margins, and that this leads to better generalization performance. Hence there is sufficient reason to believe that Large Margin Classifiers will become a core method of the standard machine learning toolbox.

## NIPS 1997: Support Vector Machines

- Leon Bottou, NEC Research
- Chris Burges, Microsoft Research
- Bernhard Schoelkopf, MPI Tuebingen
- Alex Smola, Yahoo! Research

The Support Vector (SV) learning algorithm (Boser, Guyon, Vapnik, 1992; Cortes, Vapnik, 1995; Vapnik, 1995) provides a general method for solving Pattern Recognition, Regression Estimation and Operator Inversion problems. The method is based on results in the theory of learning with finite sample sizes. The last few years have witnessed an increasing interest in SV machines, due largely to excellent results in pattern recognition, regression estimation and time series prediction experiments. The purpose of this workshop is (1) to provide an overview of recent developments in SV machines, ranging from theoretical results to applications, (2) to explore connections with other methods, and (3) to identify weaknesses, strengths and directions for future research for SVMs. We invite contributions on SV machines and related approaches, looking for empirical support wherever possible