Introduction to Machine Learning - 10-701/15-781
Machine learning studies the question “how can we build computer programs that automatically improve their performance through experience?” This includes learning to perform many types of tasks based on many types of experience. For example, it includes robots learning to better navigate based on experience gained by roaming their environments, medical decision aids that learn to predict which therapies work best for which diseases based on data mining of historical health records, and speech recognition systems that learn to better understand your speech based on experience listening to you.
This course is designed to give PhD students a thorough grounding in the methods, theory, mathematics and algorithms needed to do research and applications in machine learning. The topics of the course draw from machine learning, classical statistics, data mining, Bayesian statistics and information theory. Students entering the class with a pre-existing working knowledge of probability, statistics and algorithms will be at an advantage, but the class has been designed so that anyone with a strong numerate background can catch up and fully participate. If you are interested in this topic, but are not a PhD student, or are a PhD student not specializing in machine learning, you might consider Roni Rosenfeld's master's level course on Machine Learning, 10-601.
Applications
Optical character recognition
Bioinformatics
Computational advertising
Self-driving cars
Network security
Programming with data
Classification, Regression, Annotation
Finding unusual events
Data
Supervised
Unsupervised
Semi-supervised and transductive inference
Basic tools
Linear classification, regression
Feature maps
Trees
Instance based classifiers
Challenges
Model selection, underfitting, overfitting
Validation, confidence
Explore exploit reactive environment
Probabilities
Dependence, independence, conditional probabilities
Bayes rule and Chain rule
Paradoxes in measure theory
Parameter estimation
Maximum Likelihood Estimation (MLE)
Maximum a Posteriori Estimation (MAP)
Application I - Naive Bayes for spam filtering
Discrete features in Naive Bayes
Estimating parameters
Finite sample size problems
Application II - Naive Bayes for fMRI data processing
Continuous features in Naive Bayes
Parzen windows
Basic idea (smoothing over empirical average)
Kernels
Model selection
Overfitting and underfitting
Crossvalidation and leave-one-out estimation
Bias-variance tradeoff
Curse of dimensionality
Watson-Nadaraya estimators
Regression
Classification
Nearest neighbor estimator
Limit case via Parzen
Fast lookup
Application - Hebbian learning
Perceptron
Algorithm
Convergence proof
Properties
Kernel trick
Basic idea
Kernel Perceptron
Kernel expansion
Kernel examples
Application - Optical Character Recognition
Support Vector Machines
Large Margin Classification
Optimization Problem
Dual Problem
Properties
Support Vectors
Support Vector expansion
Soft Margin Classifier
Noise tolerance
Optimization problem
Dual problem
Kernel properties
Hilbert spaces
Convex cone, distances, norms
Mercer's theorem
Representer theorem
Kernel expansion
Kernel trick in general
Regularization and features
Shotgun approach
Regularization
Explicit feature map
Hash kernels
Application - spam filtering
Hashing trick
Application - jet engine failure detection
Support Vector Novelty Detection
Density estimation
Thresholding and scaling
Optimization problem and dual
Online setting
Support Vector Regression
Loss functions
Optimization problem and dual
Examples
Kernel PCA
Principal Component Analysis
Kernelization
Convergence properties
Weak convergence
Convergence in probability
Strong (almost surely)
Guarantees
Law of large numbers
Central limit theorem
Tail bounds (Markov, Chebyshev, Hoeffding, Bernstein, McDiarmid inequality)
Application - A/B testing for page layout
Risk and loss
Loss functions
Risk
Empirical risk vs True risk
Empirical Risk minimization
Underfitting and Overfitting
Examples
Classification
Regression
Application of McDiarmid's inequality
Infinite many hypothesis
Uniform convergence bounds
PAC bound for estimation error
Structural risk minimzation
Vapnik Chervonenkis dimension
Shattering
Growth Function
Sauer's lemma
Vapnik-Chervonenkis inequality
Vapnik-Chervonenkis theorem
Motivation - estimating height & weight
Gaussian Process
Mean function
Covariance function
Relation to kernels
Gaussian Process regression
Jointly normal distribution
Adding noise
Schur complement
Conditional Exponential Models
Logistic regression
Connection to SVMs
Poisson regression
Definition
Properties (expectations, covariance)
Multinomial, Gauss, Binary, Poisson
Principal Component Analysis
Inference
Conjugate Priors
Norm-based priors
Maximum likelihood and Maximum-a-posteriori inference
Examples - Beta, Dirichlet, Gamma, Wishart
Definition
PCA algorithms
Applications
Eigen Faces
Image Compression
Image Denosing
Application: Sprinkler, earthquake and burglar
Conditional probabilities
Joint distribution
Parameter estimation for fully observed models
Statistical model
Conditional independence
Effects of observing variables (explaining away etc.)
Causal implications
Application - Route Planning
Dynamic programming
Problem
Basic idea
Examples
Chains
Trees
Hypergraphs
Loops
Generalized Distributive Law
Semirings
Examples
Application - clustering users for a web service
Variational inference
Problem
Variational inequality
Mixtures of Gaussians
Statistical model
Variational inequality
Expectation Maximization algorithm
k-means algorithm
Limit case of mixture of Gaussians
Simple algorithm
Hidden Markov Models
Statistical model
Variational inequality
Dynamic programming (Viterbi and Baum-Welch)
Motivation
Intractable distributions
Application - stock market, parameter inference
Algorithms
Rejection sampling
Gibbs sampling
Metropolis Hastings sampler
Markov Chain Monte Carlo (just basic idea)
Markov Chains
Entropy
Basic properties, decomposition
Kullback-Leibler Divergence (with properties)
Mutual Information
Examples
Exponential families
Gaussian mutual information as special case
Application - cocktail party problem
Independent Component Analysis
JADE, Radical, InfoMax, and FastICA (time permitting)
Application - an engineer discovers machine learning
Trees
Rules (if then else)
Objective functions & criteria for splitting
Pruning
Fast implementations
Acceleration by Chernoff bounds
Data distribution
Application - finding cats on the internet without trying
Nonlinear basis functions
Optimizing weights
Optimizing function classes
Multiple layers
Hidden layers
Backprop
Architectures
Autoencoder
Boltzmann machine
Convolutional network
Application - combining experts’ advice
Boosting problem
Weak learners
Multiplicative updates
Convergence bounds
Properties
Noise sensitiviy
Modifications for robustness (e.g. rho-boost)
Applications - Viola-Jones face detection
Application - robot navigation
Linear Dynamical systems
Linear dependence
Linear observation model
Forward smoothing
Inference
Rauch-Tung-Striebel smoother
Parameter inference
Gauss elimination
Application - operating a robot
Temporal Difference Learning
Q Learning
Value function
Partially Observed Markov Decision Process
Policy methods
Policy evaluation
Policy iteration
Policy gradient
Application - building a startup
Infrastructure
Computers and Clouds
Storage
Fault tolerance
Processing paradigms
MapReduce
Streaming
(key,value) store
Algorithmic templates
Consistent hashing
Distributed hash table
Machine learning
Distributed optimization
Distributed sampling
Each team gets to present a poster. Good teams get to present a spotlight and the six best teams projects get talks. Make sure you discuss what you're doing, why you're doing it, in which way it is different or better than what's available, and what it is good for.