Syllabus

Introduction to Machine Learning - 10-701/15-781

Overview

Machine learning studies the question “how can we build computer programs that automatically improve their performance through experience?” This includes learning to perform many types of tasks based on many types of experience. For example, it includes robots learning to better navigate based on experience gained by roaming their environments, medical decision aids that learn to predict which therapies work best for which diseases based on data mining of historical health records, and speech recognition systems that learn to better understand your speech based on experience listening to you. This course is designed to give PhD students a thorough grounding in the methods, theory, mathematics and algorithms needed to do research and applications in machine learning. The topics of the course draw from machine learning, classical statistics, data mining, Bayesian statistics and information theory. Students entering the class with a pre-existing working knowledge of probability, statistics and algorithms will be at an advantage, but the class has been designed so that anyone with a strong numerate background can catch up and fully participate. If you are interested in this topic, but are not a PhD student, or are a PhD student not specializing in machine learning, you might consider Roni Rosenfeld’s master’s level course on Machine Learning, 10-601. ## Syllabus ### 1. Introduction to Machine Learning

  • Applications

    • Optical character recognition
    • Bioinformatics
    • Computational advertising
    • Self-driving cars
    • Network security
  • Programming with data

    • Classification, Regression, Annotation
    • Finding unusual events
  • Data

    • Supervised
    • Unsupervised
    • Semi-supervised and transductive inference
  • Basic tools

    • Linear classification, regression
    • Feature maps
    • Trees
    • Instance based classifiers
  • Challenges

    • Model selection, underfitting, overfitting
    • Validation, confidence
    • Explore * exploit * reactive environment

### 2. Basic Statistics

  • Probabilities

    • Dependence, independence, conditional probabilities
    • Bayes rule and Chain rule
    • Paradoxes in measure theory
  • Parameter estimation

    • Maximum Likelihood Estimation (MLE)
    • Maximum a Posteriori Estimation (MAP)
  • Application I - Naive Bayes for spam filtering

    • Discrete features in Naive Bayes
    • Estimating parameters
    • Finite sample size problems
  • Application II - Naive Bayes for fMRI data processing

    • Continuous features in Naive Bayes

### 3. Instance Based Learning

  • Parzen windows

    • Basic idea (smoothing over empirical average)
    • Kernels
  • Model selection

    • Overfitting and underfitting
    • Crossvalidation and leave-one-out estimation
    • Bias-variance tradeoff
    • Curse of dimensionality
  • Watson-Nadaraya estimators

    • Regression
    • Classification
  • Nearest neighbor estimator

    • Limit case via Parzen
    • Fast lookup

### 4. Perceptron

  • Application - Hebbian learning

  • Perceptron

    • Algorithm
    • Convergence proof
    • Properties
  • Kernel trick

    • Basic idea
    • Kernel Perceptron
    • Kernel expansion
  • Kernel examples

5. Support Vector Classification

  • Application - Optical Character Recognition

  • Support Vector Machines

    • Large Margin Classification
    • Optimization Problem
    • Dual Problem
  • Properties

    • Support Vectors
    • Support Vector expansion
  • Soft Margin Classifier

    • Noise tolerance
    • Optimization problem
    • Dual problem

### 6. Kernels

  • Kernel properties

    • Hilbert spaces
    • Convex cone, distances, norms
    • Mercer’s theorem
    • Representer theorem
    • Kernel expansion
    • Kernel trick in general
  • Regularization and features

    • Shotgun approach
    • Regularization
    • Explicit feature map
  • Hash kernels

    • Application - spam filtering
    • Hashing trick

7. Kernels

  • Application - jet engine failure detection

  • Support Vector Novelty Detection

    • Density estimation
    • Thresholding and scaling
    • Optimization problem and dual
    • Online setting
  • Support Vector Regression

    • Loss functions
    • Optimization problem and dual
    • Examples
  • Kernel PCA

    • Principal Component Analysis
    • Kernelization

### 8. Stochastic Convergence

  • Convergence properties

    • Weak convergence
    • Convergence in probability
    • Strong (almost surely)
  • Guarantees

    • Law of large numbers

### 9. Tail bounds

  • Central limit theorem
  • Tail bounds (Markov, Chebyshev, Hoeffding, Bernstein, McDiarmid inequality)
  • Application - A/B testing for page layout

### 10. Risk Minimization

  • Risk and loss

    • Loss functions
    • Risk
    • Empirical risk vs True risk
    • Empirical Risk minimization
  • Underfitting and Overfitting

  • Examples

    • Classification
    • Regression

### 11. Learning Theory

  • Application of McDiarmid’s inequality

  • Infinite many hypothesis

  • Uniform convergence bounds

  • PAC bound for estimation error

  • Structural risk minimzation

  • Vapnik Chervonenkis dimension

    • Shattering
    • Growth Function
    • Sauer’s lemma
    • Vapnik-Chervonenkis inequality
    • Vapnik-Chervonenkis theorem

### 12. Gaussian Processes

  • Motivation - estimating height & weight

  • Gaussian Process

    • Mean function
    • Covariance function
    • Relation to kernels
  • Gaussian Process regression

    • Jointly normal distribution
    • Adding noise
    • Schur complement
  • Conditional Exponential Models

    • Logistic regression
    • Connection to SVMs
    • Poisson regression

### 13. Exponential Families

  • Definition

    • Properties (expectations, covariance)
    • Multinomial, Gauss, Binary, Poisson
    • Principal Component Analysis
  • Inference

    • Conjugate Priors
    • Norm-based priors
    • Maximum likelihood and Maximum-a-posteriori inference
    • Examples - Beta, Dirichlet, Gamma, Wishart

14. Principal Component Analysis

  • Definition

    • PCA algorithms
  • Applications

    • Eigen Faces
    • Image Compression
    • Image Denosing

### 15. Directed Graphical Models

  • Application: Sprinkler, earthquake and burglar

  • Conditional probabilities

    • Joint distribution
    • Parameter estimation for fully observed models
  • Statistical model

    • Conditional independence
    • Effects of observing variables (explaining away etc.)
    • Causal implications

16. Dynamic Programming

  • Application - Route Planning

  • Dynamic programming

    • Problem
    • Basic idea
  • Examples

    • Chains
    • Trees
    • Hypergraphs
    • Loops
  • Generalized Distributive Law

    • Semirings
    • Examples

17. Latent Variable Models

  • Application - clustering users for a web service

  • Variational inference

    • Problem
    • Variational inequality
  • Mixtures of Gaussians

    • Statistical model
    • Variational inequality
    • Expectation Maximization algorithm
  • k-means algorithm

    • Limit case of mixture of Gaussians
    • Simple algorithm
  • Hidden Markov Models

    • Statistical model
    • Variational inequality
    • Dynamic programming (Viterbi and Baum-Welch)

18. Sampling

  • Motivation

    • Intractable distributions
    • Application - stock market, parameter inference
  • Algorithms

    • Rejection sampling
    • Gibbs sampling
    • Metropolis Hastings sampler
    • Markov Chain Monte Carlo (just basic idea)
  • Markov Chains

19. Information Theory

  • Entropy

    • Basic properties, decomposition
    • Kullback-Leibler Divergence (with properties)
    • Mutual Information
  • Examples

    • Exponential families
    • Gaussian mutual information as special case
  • Application - cocktail party problem

    • Independent Component Analysis
    • JADE, Radical, InfoMax, and FastICA (time permitting)

### 20. Decision Trees

  • Application - an engineer discovers machine learning

  • Trees

    • Rules (if then else)
    • Objective functions & criteria for splitting
    • Pruning
  • Fast implementations

    • Acceleration by Chernoff bounds
    • Data distribution

### 21. Neural Networks

  • Application - finding cats on the internet without trying

  • Nonlinear basis functions

    • Optimizing weights
    • Optimizing function classes
  • Multiple layers

    • Hidden layers
    • Backprop
  • Architectures

    • Autoencoder
    • Boltzmann machine
    • Convolutional network

### 22. Boosting

  • Application - combining experts’ advice

  • Boosting problem

    • Weak learners
    • Multiplicative updates
    • Convergence bounds
  • Properties

    • Noise sensitiviy
    • Modifications for robustness (e.g. rho-boost)
  • Applications - Viola-Jones face detection

23. Kalman Filter

  • Application - robot navigation

  • Linear Dynamical systems

    • Linear dependence
    • Linear observation model
    • Forward smoothing
  • Inference

    • Rauch-Tung-Striebel smoother
    • Parameter inference
    • Gauss elimination

24. Reinforcement Learning

  • Application - operating a robot

  • Temporal Difference Learning

  • Q Learning

  • Value function

  • Partially Observed Markov Decision Process

  • Policy methods

    • Policy evaluation
    • Policy iteration
    • Policy gradient

### 25. Scalability

  • Application - building a startup

  • Infrastructure

    • Computers and Clouds
    • Storage
    • Fault tolerance
  • Processing paradigms

    • MapReduce
    • Streaming
    • (key,value) store
  • Algorithmic templates

    • Consistent hashing
    • Distributed hash table
  • Machine learning

    • Distributed optimization
    • Distributed sampling

Final Project Presentations

Each team gets to present a poster. Good teams get to present a spotlight and the six best teams projects get talks. Make sure you discuss what you’re doing, why you’re doing it, in which way it is different or better than what’s available, and what it is good for.