Syllabus

Introduction to Machine Learning - 10-701/15-781

Overview

Machine learning studies the question “how can we build computer programs that automatically improve their performance through experience?” This includes learning to perform many types of tasks based on many types of experience. For example, it includes robots learning to better navigate based on experience gained by roaming their environments, medical decision aids that learn to predict which therapies work best for which diseases based on data mining of historical health records, and speech recognition systems that learn to better understand your speech based on experience listening to you.

This course is designed to give PhD students a thorough grounding in the methods, theory, mathematics and algorithms needed to do research and applications in machine learning. The topics of the course draw from machine learning, classical statistics, data mining, Bayesian statistics and information theory. Students entering the class with a pre-existing working knowledge of probability, statistics and algorithms will be at an advantage, but the class has been designed so that anyone with a strong numerate background can catch up and fully participate. If you are interested in this topic, but are not a PhD student, or are a PhD student not specializing in machine learning, you might consider Roni Rosenfeld's master's level course on Machine Learning, 10-601.

Syllabus

1. Introduction to Machine Learning

Applications
- Optical character recognition
- Bioinformatics
- Computational advertising
- Self-driving cars
- Network security
Programming with data
- Classification, Regression, Annotation
- Finding unusual events
Data
- Supervised
- Unsupervised
- Semi-supervised and transductive inference
Basic tools
- Linear classification, regression
- Feature maps
- Trees
- Instance based classifiers
Challenges
- Model selection, underfitting, overfitting
- Validation, confidence
- Explore exploit reactive environment

2. Basic Statistics

Probabilities
- Dependence, independence, conditional probabilities
- Bayes rule and Chain rule
- Paradoxes in measure theory
Parameter estimation
- Maximum Likelihood Estimation (MLE)
- Maximum a Posteriori Estimation (MAP)
Application I - Naive Bayes for spam filtering
- Discrete features in Naive Bayes
- Estimating parameters
- Finite sample size problems
Application II - Naive Bayes for fMRI data processing
- Continuous features in Naive Bayes

3. Instance Based Learning

Parzen windows
- Basic idea (smoothing over empirical average)
- Kernels
Model selection
- Overfitting and underfitting
- Crossvalidation and leave-one-out estimation
- Bias-variance tradeoff
- Curse of dimensionality
Watson-Nadaraya estimators
- Regression
- Classification
Nearest neighbor estimator
- Limit case via Parzen
- Fast lookup

4. Perceptron

Application - Hebbian learning
Perceptron
- Algorithm
- Convergence proof
- Properties
Kernel trick
- Basic idea
- Kernel Perceptron
- Kernel expansion
Kernel examples

5. Support Vector Classification

Application - Optical Character Recognition
Support Vector Machines
- Large Margin Classification
- Optimization Problem
- Dual Problem
Properties
- Support Vectors
- Support Vector expansion
Soft Margin Classifier
- Noise tolerance
- Optimization problem
- Dual problem

6. Kernels

Kernel properties
- Hilbert spaces
- Convex cone, distances, norms
- Mercer's theorem
- Representer theorem
- Kernel expansion
- Kernel trick in general
Regularization and features
- Shotgun approach
- Regularization
- Explicit feature map
Hash kernels
- Application - spam filtering
- Hashing trick

7. Kernels

Application - jet engine failure detection
Support Vector Novelty Detection
- Density estimation
- Thresholding and scaling
- Optimization problem and dual
- Online setting
Support Vector Regression
- Loss functions
- Optimization problem and dual
- Examples
Kernel PCA
- Principal Component Analysis
- Kernelization

8. Stochastic Convergence

Convergence properties
- Weak convergence
- Convergence in probability
- Strong (almost surely)
Guarantees
- Law of large numbers

9. Tail bounds

Central limit theorem
Tail bounds (Markov, Chebyshev, Hoeffding, Bernstein, McDiarmid inequality)
Application - A/B testing for page layout

10. Risk Minimization

Risk and loss
- Loss functions
- Risk
- Empirical risk vs True risk
- Empirical Risk minimization
Underfitting and Overfitting
Examples
- Classification
- Regression

11. Learning Theory

- Application of McDiarmid's inequality
- Infinite many hypothesis
- Uniform convergence bounds
- PAC bound for estimation error
- Structural risk minimzation

Vapnik Chervonenkis dimension
- Shattering
- Growth Function
- Sauer's lemma
- Vapnik-Chervonenkis inequality
- Vapnik-Chervonenkis theorem

12. Gaussian Processes

Motivation - estimating height & weight
Gaussian Process
- Mean function
- Covariance function
- Relation to kernels
Gaussian Process regression
- Jointly normal distribution
- Adding noise
- Schur complement
Conditional Exponential Models
- Logistic regression
- Connection to SVMs
- Poisson regression

13. Exponential Families

Definition
- Properties (expectations, covariance)
- Multinomial, Gauss, Binary, Poisson
- Principal Component Analysis
Inference
- Conjugate Priors
- Norm-based priors
- Maximum likelihood and Maximum-a-posteriori inference
- Examples - Beta, Dirichlet, Gamma, Wishart

14. Principal Component Analysis

Definition
- PCA algorithms
Applications
- Eigen Faces
- Image Compression
- Image Denosing

15. Directed Graphical Models

Application: Sprinkler, earthquake and burglar
Conditional probabilities
- Joint distribution
- Parameter estimation for fully observed models
Statistical model
- Conditional independence
- Effects of observing variables (explaining away etc.)
- Causal implications

16. Dynamic Programming

Application - Route Planning
Dynamic programming
- Problem
- Basic idea
Examples
- Chains
- Trees
- Hypergraphs
- Loops
Generalized Distributive Law
- Semirings
- Examples

17. Latent Variable Models

Application - clustering users for a web service
Variational inference
- Problem
- Variational inequality
Mixtures of Gaussians
- Statistical model
- Variational inequality
- Expectation Maximization algorithm
k-means algorithm
- Limit case of mixture of Gaussians
- Simple algorithm
Hidden Markov Models
- Statistical model
- Variational inequality
- Dynamic programming (Viterbi and Baum-Welch)

18. Sampling

Motivation
- Intractable distributions
- Application - stock market, parameter inference
Algorithms
- Rejection sampling
- Gibbs sampling
- Metropolis Hastings sampler
- Markov Chain Monte Carlo (just basic idea)
Markov Chains

19. Information Theory

Entropy
- Basic properties, decomposition
- Kullback-Leibler Divergence (with properties)
- Mutual Information
Examples
- Exponential families
- Gaussian mutual information as special case
Application - cocktail party problem
- Independent Component Analysis
- JADE, Radical, InfoMax, and FastICA (time permitting)

20. Decision Trees

Application - an engineer discovers machine learning
Trees
- Rules (if then else)
- Objective functions & criteria for splitting
- Pruning
Fast implementations
- Acceleration by Chernoff bounds
- Data distribution

21. Neural Networks

Application - finding cats on the internet without trying
Nonlinear basis functions
- Optimizing weights
- Optimizing function classes
Multiple layers
- Hidden layers
- Backprop
Architectures
- Autoencoder
- Boltzmann machine
- Convolutional network

22. Boosting

Application - combining experts’ advice
Boosting problem
- Weak learners
- Multiplicative updates
- Convergence bounds
Properties
- Noise sensitiviy
- Modifications for robustness (e.g. rho-boost)
Applications - Viola-Jones face detection

23. Kalman Filter

Application - robot navigation
Linear Dynamical systems
- Linear dependence
- Linear observation model
- Forward smoothing
Inference
- Rauch-Tung-Striebel smoother
- Parameter inference
- Gauss elimination

24. Reinforcement Learning

Application - operating a robot
Temporal Difference Learning
Q Learning
Value function
Partially Observed Markov Decision Process
Policy methods
- Policy evaluation
- Policy iteration
- Policy gradient

25. Scalability

Application - building a startup
Infrastructure
- Computers and Clouds
- Storage
- Fault tolerance
Processing paradigms
- MapReduce
- Streaming
- (key,value) store
Algorithmic templates
- Consistent hashing
- Distributed hash table
Machine learning
- Distributed optimization
- Distributed sampling

Final Project Presentations

Each team gets to present a poster. Good teams get to present a spotlight and the six best teams projects get talks. Make sure you discuss what you're doing, why you're doing it, in which way it is different or better than what's available, and what it is good for.