Syllabus
Introduction to Machine Learning - 10-701/15-781
Overview
Machine learning studies the question “how can we build computer programs that automatically improve their performance through experience?” This includes learning to perform many types of tasks based on many types of experience. For example, it includes robots learning to better navigate based on experience gained by roaming their environments, medical decision aids that learn to predict which therapies work best for which diseases based on data mining of historical health records, and speech recognition systems that learn to better understand your speech based on experience listening to you. This course is designed to give PhD students a thorough grounding in the methods, theory, mathematics and algorithms needed to do research and applications in machine learning. The topics of the course draw from machine learning, classical statistics, data mining, Bayesian statistics and information theory. Students entering the class with a pre-existing working knowledge of probability, statistics and algorithms will be at an advantage, but the class has been designed so that anyone with a strong numerate background can catch up and fully participate. If you are interested in this topic, but are not a PhD student, or are a PhD student not specializing in machine learning, you might consider Roni Rosenfeld’s master’s level course on Machine Learning, 10-601. ## Syllabus ### 1. Introduction to Machine Learning
Applications
- Optical character recognition
- Bioinformatics
- Computational advertising
- Self-driving cars
- Network security
Programming with data
- Classification, Regression, Annotation
- Finding unusual events
Data
- Supervised
- Unsupervised
- Semi-supervised and transductive inference
Basic tools
- Linear classification, regression
- Feature maps
- Trees
- Instance based classifiers
Challenges
- Model selection, underfitting, overfitting
- Validation, confidence
- Explore * exploit * reactive environment
Probabilities
- Dependence, independence, conditional probabilities
- Bayes rule and Chain rule
- Paradoxes in measure theory
Parameter estimation
- Maximum Likelihood Estimation (MLE)
- Maximum a Posteriori Estimation (MAP)
Application I - Naive Bayes for spam filtering
- Discrete features in Naive Bayes
- Estimating parameters
- Finite sample size problems
Application II - Naive Bayes for fMRI data processing
- Continuous features in Naive Bayes
### 3. Instance Based Learning
Parzen windows
- Basic idea (smoothing over empirical average)
- Kernels
Model selection
- Overfitting and underfitting
- Crossvalidation and leave-one-out estimation
- Bias-variance tradeoff
- Curse of dimensionality
Watson-Nadaraya estimators
- Regression
- Classification
Nearest neighbor estimator
- Limit case via Parzen
- Fast lookup
### 4. Perceptron
Application - Hebbian learning
Perceptron
- Algorithm
- Convergence proof
- Properties
Kernel trick
- Basic idea
- Kernel Perceptron
- Kernel expansion
Kernel examples
5. Support Vector Classification
Application - Optical Character Recognition
Support Vector Machines
- Large Margin Classification
- Optimization Problem
- Dual Problem
Properties
- Support Vectors
- Support Vector expansion
Soft Margin Classifier
- Noise tolerance
- Optimization problem
- Dual problem
### 6. Kernels
Kernel properties
- Hilbert spaces
- Convex cone, distances, norms
- Mercer’s theorem
- Representer theorem
- Kernel expansion
- Kernel trick in general
Regularization and features
- Shotgun approach
- Regularization
- Explicit feature map
Hash kernels
- Application - spam filtering
- Hashing trick
7. Kernels
Application - jet engine failure detection
Support Vector Novelty Detection
- Density estimation
- Thresholding and scaling
- Optimization problem and dual
- Online setting
Support Vector Regression
- Loss functions
- Optimization problem and dual
- Examples
Kernel PCA
- Principal Component Analysis
- Kernelization
Convergence properties
- Weak convergence
- Convergence in probability
- Strong (almost surely)
Guarantees
- Law of large numbers
### 9. Tail bounds
- Central limit theorem
- Tail bounds (Markov, Chebyshev, Hoeffding, Bernstein, McDiarmid inequality)
- Application - A/B testing for page layout
Risk and loss
- Loss functions
- Risk
- Empirical risk vs True risk
- Empirical Risk minimization
Underfitting and Overfitting
Examples
- Classification
- Regression
Application of McDiarmid’s inequality
Infinite many hypothesis
Uniform convergence bounds
PAC bound for estimation error
Structural risk minimzation
Vapnik Chervonenkis dimension
- Shattering
- Growth Function
- Sauer’s lemma
- Vapnik-Chervonenkis inequality
- Vapnik-Chervonenkis theorem
Motivation - estimating height & weight
Gaussian Process
- Mean function
- Covariance function
- Relation to kernels
Gaussian Process regression
- Jointly normal distribution
- Adding noise
- Schur complement
Conditional Exponential Models
- Logistic regression
- Connection to SVMs
- Poisson regression
Definition
- Properties (expectations, covariance)
- Multinomial, Gauss, Binary, Poisson
- Principal Component Analysis
Inference
- Conjugate Priors
- Norm-based priors
- Maximum likelihood and Maximum-a-posteriori inference
- Examples - Beta, Dirichlet, Gamma, Wishart
14. Principal Component Analysis
Definition
- PCA algorithms
Applications
- Eigen Faces
- Image Compression
- Image Denosing
### 15. Directed Graphical Models
Application: Sprinkler, earthquake and burglar
Conditional probabilities
- Joint distribution
- Parameter estimation for fully observed models
Statistical model
- Conditional independence
- Effects of observing variables (explaining away etc.)
- Causal implications
16. Dynamic Programming
Application - Route Planning
Dynamic programming
- Problem
- Basic idea
Examples
- Chains
- Trees
- Hypergraphs
- Loops
Generalized Distributive Law
- Semirings
- Examples
17. Latent Variable Models
Application - clustering users for a web service
Variational inference
- Problem
- Variational inequality
Mixtures of Gaussians
- Statistical model
- Variational inequality
- Expectation Maximization algorithm
k-means algorithm
- Limit case of mixture of Gaussians
- Simple algorithm
Hidden Markov Models
- Statistical model
- Variational inequality
- Dynamic programming (Viterbi and Baum-Welch)
18. Sampling
Motivation
- Intractable distributions
- Application - stock market, parameter inference
Algorithms
- Rejection sampling
- Gibbs sampling
- Metropolis Hastings sampler
- Markov Chain Monte Carlo (just basic idea)
Markov Chains
19. Information Theory
Entropy
- Basic properties, decomposition
- Kullback-Leibler Divergence (with properties)
- Mutual Information
Examples
- Exponential families
- Gaussian mutual information as special case
Application - cocktail party problem
- Independent Component Analysis
- JADE, Radical, InfoMax, and FastICA (time permitting)
Application - an engineer discovers machine learning
Trees
- Rules (if then else)
- Objective functions & criteria for splitting
- Pruning
Fast implementations
- Acceleration by Chernoff bounds
- Data distribution
Application - finding cats on the internet without trying
Nonlinear basis functions
- Optimizing weights
- Optimizing function classes
Multiple layers
- Hidden layers
- Backprop
Architectures
- Autoencoder
- Boltzmann machine
- Convolutional network
### 22. Boosting
Application - combining experts’ advice
Boosting problem
- Weak learners
- Multiplicative updates
- Convergence bounds
Properties
- Noise sensitiviy
- Modifications for robustness (e.g. rho-boost)
Applications - Viola-Jones face detection
23. Kalman Filter
Application - robot navigation
Linear Dynamical systems
- Linear dependence
- Linear observation model
- Forward smoothing
Inference
- Rauch-Tung-Striebel smoother
- Parameter inference
- Gauss elimination
24. Reinforcement Learning
Application - operating a robot
Temporal Difference Learning
Q Learning
Value function
Partially Observed Markov Decision Process
Policy methods
- Policy evaluation
- Policy iteration
- Policy gradient
### 25. Scalability
Application - building a startup
Infrastructure
- Computers and Clouds
- Storage
- Fault tolerance
Processing paradigms
- MapReduce
- Streaming
- (key,value) store
Algorithmic templates
- Consistent hashing
- Distributed hash table
Machine learning
- Distributed optimization
- Distributed sampling
Final Project Presentations
Each team gets to present a poster. Good teams get to present a spotlight and the six best teams projects get talks. Make sure you discuss what you’re doing, why you’re doing it, in which way it is different or better than what’s available, and what it is good for.