SML: Syllabus

STATISTICS 241B, COMPUTER SCIENCE 281B

Overview

Scalable Machine Learning occurs when Statistics, Systems, Machine Learning and Data Mining are combined into flexible, often nonparametric, and scalable techniques for analyzing large amounts of data at internet scale. This class aims to teach methods which are going to power the next generation of internet applications. The class will cover systems and processing paradigms, an introduction to statistical analysis, algorithms for data streams, generalized linear methods (logistic models, support vector machines, etc.), large scale convex optimization, kernels, graphical models and inference algorithms such as sampling and variational approximations, and explore/exploit mechanisms. Applications include social recommender systems, real time analytics, spam filtering, topic models, and document analysis.

Lectures

  1. Systems

  2. Basic Statistics

  3. Data Sketches and Streams

  4. Optimization

  5. Generalized Linear Models

  6. Kernels and Regularization

  7. Recommender Systems

  8. Project Presentations

  9. Graphical Models

  10. Latent Variable Model Templates

  11. Structured Estimation

  12. Inference in Graphical Models

  13. Applications

  14. Explore Exploit

  15. Final Project Presentations

Syllabus

1. Systems

  • Hardware

    • Processor, RAM, buses, GPU, disk, SSD, network, switches, racks, server centers

    • Bandwidth, latency and faults

  • Basic parallelization paradigms

    • Trees, stars, rings, queues

    • Hashing (consistent, proportional)

    • Distributed hash tables and P2P

  • Storage

    • RAID

    • Google File System / HadoopFS

    • Distributed (key, value) storage

  • Processing

    • MapReduce

    • Dryad

    • S4 / stream processing

  • Structured access beyond SQL

    • BigTable

    • Cassandra

2. Basic Statistics

  • Probabilities

    • Bayes rule

    • Dependence independence conditional probabilities

    • Priors

    • Naive Bayes classifier

  • Tail bounds

    • Chernoff / Hoeffding

    • Chebyshev

    • Gaussian

    • A/B testing

  • Kernel density estimation

    • Parzen windows

    • Nearest neighbors

    • Watson-Nadaraya estimator

  • Exponential families

    • Gaussian, multinomial and Poisson

    • Conjugate distributions and smoothing

    • Integrating out

3. Data Sketches and Streams

  • Random subsets

    • Hash functions

    • Approximate moments

    • Alon-Matias-Szegedy sketch

  • Heavy hitter detection

    • Lossy counting

    • Space saving

  • Randomized statistics

    • Flajolet counter

    • Bloom filter and logarithmic randomized techniques

    • CountMin sketch

  • Realtime analytics

    • Fault tolerance and scalability

    • Interpolating sketches

    • Autoregressive predictors

4. Optimization

  • Unconstrained problems

    • Gradient descent

    • Newton's method

    • Conjugate gradient descent

    • Broden-Fletcher-Goldfarb-Shanno (BFGS)

  • Convexity

    • Properties

    • Lagrange function

    • Wolfe dual

  • Batch methods

    • Distributed subgradient

    • Bundle methods

    • Iterative thresholding (ISTA, FISTA)

    • Dual decomposition

  • Online methods

    • Unconstrained subgradient

    • Gradient projections

    • Parallel optimization

5. Generalized Linear Models

  • Scalar models

    • Principal component analysis

    • Perceptron

    • Regression

    • Classification (soft-margin and logistic loss)

  • Kernel trick

    • Simple kernels

    • Kernel PCA

    • Support Vector Machine classification

    • Regression

    • Logistic regression

    • Novelty detection

  • Optimization

    • Primal space stochastic

    • Primal space batch (dual subgradient)

    • Dual space convex programming

    • Dual space SMO

6. Kernels and Regularization

  • Kernels

    • Hilbert Spaces and inner products

    • Mercer's theorem

    • Functional analytic tools (smoothness, Greens functions, Laplace operator)

  • Regularization

    • Operators

    • Sparsity

    • Priors / Gaussian processes

  • Graphs

    • Neighborhood smoothing

    • Graph Laplacian and Label Diffusion

    • Graph kernels

    • Fast inverse parametrization

  • Fast kernel computation

    • Hash kernels

    • String kernels

    • Random kitchen sinks

    • Reduced rank expansions

  • Sparsity

    • l1 regularization

    • Reconstruction guarantees (compressed sensing)

    • FISTA for optimization

7. Recommender Systems

  • Matrix Factorization

    • Singular value decomposition (full and incomplete matrix)

    • Latent Semantic Indexing

    • More loss functions

    • Convex reformulations

  • Ranking and Session Modeling

    • Ordinal regression

    • Set based scores

    • Isotonic scores

    • Session models

  • Features

    • Latent dense (Bayesian Probabilistic Matrix Factorization)

    • Latent sparse (Dirichlet process factorization)

    • Coldstart problem (inferring features)

  • Large scale solvers

    • Alternating optimization

    • Stochastic gradient descent

    • Hashing

    • Parallelization

8. Midterm Project Presentations

Maximum team size is 4, and a typical team should have 3 members. Each team gets to pitch their project to the class for 10 minutes and hand in a written documentation of at least 4 and at most 10 pages of a reasonable font size. You should be able to address the following criteria (adapted from Heilmeier's criteria for the purpose of this class). This type of reasoning will help you with choosing your own research agenda, writing grants, convincing colleagues, securing VC funding, and writing papers.

  • What are you trying to do? Articulate your objectives using absolutely no jargon.

  • How is it done today, and what are the limits of current practice?

  • What's new in your approach and why do you think it will be successful?

  • Who cares? If you're successful, what difference will it make?

  • What are the risks and the payoffs?

  • How long will it take and what have you achieved so far?

  • How will you determine success?

9. Graphical Models

  • Directed Graphical Models

    • Dependence

    • Inference for fully observed models

    • Incomplete information / variational and sampling inference

  • Undirected Graphical Models

    • Hammersley Clifford decomposition

    • Conditional independence

    • Junction trees

  • Dynamic Programming

    • Generalized Distributive Law

    • Naive Message Passing

  • Inference techniques

    • Sampling (Gibbs and Monte Carlo)

    • Variational methods (EM, extensions, dual decomposition)

10. Latent Variable Model Templates

  • Sets

    • Clusters

    • Trees (and hierarchical clustering)

    • Topics (and mixtures)

    • Supervised variants

  • Chains

    • Markov models

    • Simplical mixtures

    • Semi-Markov models

  • Factorizations

    • Latent Factor models

    • Hierarchical topic models

    • Recommender systems revisited

11. Structured Estimation

  • Conditional Random Fields

    • Sequence annotation

    • Tagging

    • Semi-Markov models / sequence segmentation

  • Max-Margin Estimation

    • Large Margin formulation

    • Scaled Margin variants

    • Approximate margin

    • Dual variants

  • Applications

    • Segmentation

    • Annotation (gene finding, entity tagging)

    • F1/AUC score optimization

    • Ranking

    • Robust and invariant estimation

12. Large Scale Inference in Graphical Models

  • Variational methods

    • Global variable models (clustering, topics)

    • Variational iterations

    • Bulk synchronous algorithms

  • Message passing

    • Parallel updates

    • Graph coloring

    • Parallel Gibbs Sampler

    • Graphlab

  • Approximations for Scalable Inference

    • Large local state

    • Large global state

    • Out of core storage

    • Asynchronous scheduling

    • Fast sampling (item ordering, heaps, fast proposals, SIMD)

13. Applications

  • Content filtering

    • Mail spam (personalization, hashing)

    • Link spam (graph regularization, stochastic gradient descent)

  • Search and ranking

    • Search using editorial information

  • Advertising

    • Matching ad retrieval

    • Click estimation

    • Supply forecasting

  • Content analysis

    • Entity detection

  • User models

  • Session models

  • Content models

  • Recommendation systems

14. Explore Exploit

  • Bandits

    • UCB

    • EXP3

    • Bayesian bandits

    • Thompson sampling

  • Contextual Bandits

    • Gaussian Process Regression Bandits

    • General Bayesian settings

    • Regret based formulations

  • Applications

    • Computational Advertising

    • Content recommendation

15. Final Project Presentations

Each team gets to give a final presentation of their project to the class. This may be as a traditional talk, a demo, a product, an app, or any combination thereof. Make sure you discuss what you're doing, why you're doing it, in which way it is different or better than what's available, and what it is good for.