SML: Syllabus

STATISTICS 241B, COMPUTER SCIENCE 281B

Overview

Scalable Machine Learning occurs when Statistics, Systems, Machine Learning and Data Mining are combined into flexible, often nonparametric, and scalable techniques for analyzing large amounts of data at internet scale. This class aims to teach methods which are going to power the next generation of internet applications. The class will cover systems and processing paradigms, an introduction to statistical analysis, algorithms for data streams, generalized linear methods (logistic models, support vector machines, etc.), large scale convex optimization, kernels, graphical models and inference algorithms such as sampling and variational approximations, and explore/exploit mechanisms. Applications include social recommender systems, real time analytics, spam filtering, topic models, and document analysis.

Syllabus

1. Systems

Hardware
- Processor, RAM, buses, GPU, disk, SSD, network, switches, racks, server centers
- Bandwidth, latency and faults
Basic parallelization paradigms
- Trees, stars, rings, queues
- Hashing (consistent, proportional)
- Distributed hash tables and P2P
Storage
- RAID
- Google File System / HadoopFS
- Distributed (key, value) storage
Processing
- MapReduce
- Dryad
- S4 / stream processing
Structured access beyond SQL
- BigTable
- Cassandra

2. Basic Statistics

Probabilities
- Bayes rule
- Dependence independence conditional probabilities
- Priors
- Naive Bayes classifier
Tail bounds
- Chernoff / Hoeffding
- Chebyshev
- Gaussian
- A/B testing
Kernel density estimation
- Parzen windows
- Nearest neighbors
- Watson-Nadaraya estimator
Exponential families
- Gaussian, multinomial and Poisson
- Conjugate distributions and smoothing
- Integrating out

3. Data Sketches and Streams

Random subsets
- Hash functions
- Approximate moments
- Alon-Matias-Szegedy sketch
Heavy hitter detection
- Lossy counting
- Space saving
Randomized statistics
- Flajolet counter
- Bloom filter and logarithmic randomized techniques
- CountMin sketch
Realtime analytics
- Fault tolerance and scalability
- Interpolating sketches
- Autoregressive predictors

4. Optimization

Unconstrained problems
- Gradient descent
- Newton's method
- Conjugate gradient descent
- Broden-Fletcher-Goldfarb-Shanno (BFGS)
Convexity
- Properties
- Lagrange function
- Wolfe dual
Batch methods
- Distributed subgradient
- Bundle methods
- Iterative thresholding (ISTA, FISTA)
- Dual decomposition
Online methods
- Unconstrained subgradient
- Gradient projections
- Parallel optimization

5. Generalized Linear Models

Scalar models
- Principal component analysis
- Perceptron
- Regression
- Classification (soft-margin and logistic loss)
Kernel trick
- Simple kernels
- Kernel PCA
- Support Vector Machine classification
- Regression
- Logistic regression
- Novelty detection
Optimization
- Primal space stochastic
- Primal space batch (dual subgradient)
- Dual space convex programming
- Dual space SMO

6. Kernels and Regularization

Kernels
- Hilbert Spaces and inner products
- Mercer's theorem
- Functional analytic tools (smoothness, Greens functions, Laplace operator)
Regularization
- Operators
- Sparsity
- Priors / Gaussian processes
Graphs
- Neighborhood smoothing
- Graph Laplacian and Label Diffusion
- Graph kernels
- Fast inverse parametrization
Fast kernel computation
- Hash kernels
- String kernels
- Random kitchen sinks
- Reduced rank expansions
Sparsity
- l1 regularization
- Reconstruction guarantees (compressed sensing)
- FISTA for optimization

7. Recommender Systems

Matrix Factorization
- Singular value decomposition (full and incomplete matrix)
- Latent Semantic Indexing
- More loss functions
- Convex reformulations
Ranking and Session Modeling
- Ordinal regression
- Set based scores
- Isotonic scores
- Session models
Features
- Latent dense (Bayesian Probabilistic Matrix Factorization)
- Latent sparse (Dirichlet process factorization)
- Coldstart problem (inferring features)
Large scale solvers
- Alternating optimization
- Stochastic gradient descent
- Hashing
- Parallelization

8. Midterm Project Presentations

Maximum team size is 4, and a typical team should have 3 members. Each team gets to pitch their project to the class for 10 minutes and hand in a written documentation of at least 4 and at most 10 pages of a reasonable font size. You should be able to address the following criteria (adapted from Heilmeier's criteria for the purpose of this class). This type of reasoning will help you with choosing your own research agenda, writing grants, convincing colleagues, securing VC funding, and writing papers.

What are you trying to do? Articulate your objectives using absolutely no jargon.
How is it done today, and what are the limits of current practice?
What's new in your approach and why do you think it will be successful?
Who cares? If you're successful, what difference will it make?
What are the risks and the payoffs?
How long will it take and what have you achieved so far?
How will you determine success?

9. Graphical Models

Directed Graphical Models
- Dependence
- Inference for fully observed models
- Incomplete information / variational and sampling inference
Undirected Graphical Models
- Hammersley Clifford decomposition
- Conditional independence
- Junction trees
Dynamic Programming
- Generalized Distributive Law
- Naive Message Passing
Inference techniques
- Sampling (Gibbs and Monte Carlo)
- Variational methods (EM, extensions, dual decomposition)

10. Latent Variable Model Templates

Sets
- Clusters
- Trees (and hierarchical clustering)
- Topics (and mixtures)
- Supervised variants
Chains
- Markov models
- Simplical mixtures
- Semi-Markov models
Factorizations
- Latent Factor models
- Hierarchical topic models
- Recommender systems revisited

11. Structured Estimation

Conditional Random Fields
- Sequence annotation
- Tagging
- Semi-Markov models / sequence segmentation
Max-Margin Estimation
- Large Margin formulation
- Scaled Margin variants
- Approximate margin
- Dual variants
Applications
- Segmentation
- Annotation (gene finding, entity tagging)
- F1/AUC score optimization
- Ranking
- Robust and invariant estimation

12. Large Scale Inference in Graphical Models

Variational methods
- Global variable models (clustering, topics)
- Variational iterations
- Bulk synchronous algorithms
Message passing
- Parallel updates
- Graph coloring
- Parallel Gibbs Sampler
- Graphlab
Approximations for Scalable Inference
- Large local state
- Large global state
- Out of core storage
- Asynchronous scheduling
- Fast sampling (item ordering, heaps, fast proposals, SIMD)

13. Applications

Content filtering
- Mail spam (personalization, hashing)
- Link spam (graph regularization, stochastic gradient descent)
Search and ranking
- Search using editorial information
Advertising
- Matching ad retrieval
- Click estimation
- Supply forecasting
Content analysis
- Entity detection
User models
Session models
Content models
Recommendation systems

14. Explore Exploit

Bandits
- UCB
- EXP3
- Bayesian bandits
- Thompson sampling
Contextual Bandits
- Gaussian Process Regression Bandits
- General Bayesian settings
- Regret based formulations
Applications
- Computational Advertising
- Content recommendation

15. Final Project Presentations

Each team gets to give a final presentation of their project to the class. This may be as a traditional talk, a demo, a product, an app, or any combination thereof. Make sure you discuss what you're doing, why you're doing it, in which way it is different or better than what's available, and what it is good for.