## SML: SyllabusSTATISTICS 241B,
COMPUTER SCIENCE 281B
## OverviewScalable Machine Learning occurs when Statistics, Systems, Machine Learning and Data Mining are combined into flexible, often nonparametric, and scalable techniques for analyzing large amounts of data at internet scale. This class aims to teach methods which are going to power the next generation of internet applications. The class will cover systems and processing paradigms, an introduction to statistical analysis, algorithms for data streams, generalized linear methods (logistic models, support vector machines, etc.), large scale convex optimization, kernels, graphical models and inference algorithms such as sampling and variational approximations, and explore/exploit mechanisms. Applications include social recommender systems, real time analytics, spam filtering, topic models, and document analysis. ## Lectures
## Syllabus
## 1. Systems
Hardware Processor, RAM, buses, GPU, disk, SSD, network, switches, racks, server centers Bandwidth, latency and faults
Basic parallelization paradigms Trees, stars, rings, queues Hashing (consistent, proportional) Distributed hash tables and P2P
Storage RAID Google File System / HadoopFS Distributed (key, value) storage
Processing MapReduce Dryad S4 / stream processing
Structured access beyond SQL BigTable Cassandra
## 2. Basic Statistics
Probabilities Bayes rule Dependence *independence*conditional probabilitiesPriors Naive Bayes classifier
Tail bounds Chernoff / Hoeffding Chebyshev Gaussian A/B testing
Kernel density estimation Parzen windows Nearest neighbors Watson-Nadaraya estimator
Exponential families Gaussian, multinomial and Poisson Conjugate distributions and smoothing Integrating out
## 3. Data Sketches and Streams
Random subsets Hash functions Approximate moments Alon-Matias-Szegedy sketch
Heavy hitter detection Lossy counting Space saving
Randomized statistics Flajolet counter Bloom filter and logarithmic randomized techniques CountMin sketch
Realtime analytics Fault tolerance and scalability Interpolating sketches Autoregressive predictors
## 4. Optimization
Unconstrained problems Gradient descent Newton's method Conjugate gradient descent Broden-Fletcher-Goldfarb-Shanno (BFGS)
Convexity Properties Lagrange function Wolfe dual
Batch methods Distributed subgradient Bundle methods Iterative thresholding (ISTA, FISTA) Dual decomposition
Online methods Unconstrained subgradient Gradient projections Parallel optimization
## 5. Generalized Linear ModelsScalar models Principal component analysis Perceptron Regression Classification (soft-margin and logistic loss)
Kernel trick Simple kernels Kernel PCA Support Vector Machine classification Regression Logistic regression Novelty detection
Optimization Primal space stochastic Primal space batch (dual subgradient) Dual space convex programming Dual space SMO
## 6. Kernels and RegularizationKernels Hilbert Spaces and inner products Mercer's theorem Functional analytic tools (smoothness, Greens functions, Laplace operator)
Regularization Operators Sparsity Priors / Gaussian processes
Graphs Neighborhood smoothing Graph Laplacian and Label Diffusion Graph kernels Fast inverse parametrization
Fast kernel computation Hash kernels String kernels Random kitchen sinks Reduced rank expansions
Sparsity l1 regularization Reconstruction guarantees (compressed sensing) FISTA for optimization
## 7. Recommender SystemsMatrix Factorization Singular value decomposition (full and incomplete matrix) Latent Semantic Indexing More loss functions Convex reformulations
Ranking and Session Modeling Ordinal regression Set based scores Isotonic scores Session models
Features Latent dense (Bayesian Probabilistic Matrix Factorization) Latent sparse (Dirichlet process factorization) Coldstart problem (inferring features)
Large scale solvers Alternating optimization Stochastic gradient descent Hashing Parallelization
## 8. Midterm Project Presentations
Maximum team size is 4, and a typical team should have 3 members.
Each team gets to pitch their project to the class for 10 minutes and hand in a written documentation of at least 4 and at most 10 pages of a reasonable font size. You should be able to address the following criteria (adapted from
Heilmeier's criteria
for the purpose of this class). This type of reasoning will
help you with choosing your own research agenda, writing grants,
convincing colleagues, securing VC funding, and writing papers.
**What**are you trying to do? Articulate your objectives using absolutely no jargon.**How**is it done today, and what are the limits of current practice?What's **new**in your approach and**why**do you think it will be successful?**Who**cares? If you're successful, what difference will it make?What are the **risks**and the**payoffs**?**How long**will it take and what have you achieved so far?How will you **determine success**?
## 9. Graphical ModelsDirected Graphical Models Dependence Inference for fully observed models Incomplete information / variational and sampling inference
Undirected Graphical Models Hammersley Clifford decomposition Conditional independence Junction trees
Dynamic Programming Generalized Distributive Law Naive Message Passing
Inference techniques Sampling (Gibbs and Monte Carlo) Variational methods (EM, extensions, dual decomposition)
## 10. Latent Variable Model TemplatesSets Clusters Trees (and hierarchical clustering) Topics (and mixtures) Supervised variants
Chains Markov models Simplical mixtures Semi-Markov models
Factorizations Latent Factor models Hierarchical topic models Recommender systems revisited
## 11. Structured EstimationConditional Random Fields Sequence annotation Tagging Semi-Markov models / sequence segmentation
Max-Margin Estimation Large Margin formulation Scaled Margin variants Approximate margin Dual variants
Applications Segmentation Annotation (gene finding, entity tagging) F1/AUC score optimization Ranking Robust and invariant estimation
## 12. Large Scale Inference in Graphical ModelsVariational methods Global variable models (clustering, topics) Variational iterations Bulk synchronous algorithms
Message passing Parallel updates Graph coloring Parallel Gibbs Sampler Graphlab
Approximations for Scalable Inference Large local state Large global state Out of core storage Asynchronous scheduling Fast sampling (item ordering, heaps, fast proposals, SIMD)
## 13. ApplicationsContent filtering Mail spam (personalization, hashing) Link spam (graph regularization, stochastic gradient descent)
Search and ranking Search using editorial information
Advertising Matching ad retrieval Click estimation Supply forecasting
Content analysis Entity detection
User models Session models Content models Recommendation systems
## 14. Explore ExploitBandits UCB EXP3 Bayesian bandits Thompson sampling
Contextual Bandits Gaussian Process Regression Bandits General Bayesian settings Regret based formulations
Applications Computational Advertising Content recommendation
## 15. Final Project Presentations
Each team gets to give a final presentation of their project to the
class. This may be as a traditional talk, a demo, a product, an app,
or any combination thereof. Make sure you discuss |