Scaling Machine Learning

Amr Ahmed, Google ST
Alex Smola, Google ST and Carnegie Mellon University, MLD

This tutorial discusses machine learning and systems aspects for big data analytics. In particular, we will given an overview of modern distributed processing systems and sources of large amounts of data. We will discuss the parameter server concept in considerable detail and then show how it can be applied to solve a variety of large scale learning problems ranging from Terascale convex optimization, topic modeling on billions of users, generative models for mixed datatypes such as geotagged microblogs, and factorization and recommender systems.

Systems and Data

Data at internet scale does not exist in isolation. That is, it arises as a product of services such as search, advertising, user generated content, rich media content, or location data. This means that data is inherently connected to the form in which it is stored, distributed, and accessible over many machines in a server center. We will give an overview over the following systems:

Hadoop and MapReduce: Over a decade old, this captures one of the first strategies of solving distributed data processing systems. Emphasis is placed on the ability to process data without too much need for communication, besides episodical exchanges. Despite the simplicity of its API it offers efficient implementation for many batch synchronous data analysis algorithms.
Spark: In many cases moderate amounts of data can be held in distributed memory for processing. This is the key idea in Spark. Efficient machine learning algorithms can be built using this as platform.
Dryad: This is one of the first dataflow and graph processing paradigms.

Besides these three tools, a multitude of other systems has sprung up, such as S4, Reef, Storm. It is also useful to consider some prototypical applications in greater detail.

Computational Advertising: One of the core primitives of the digital economy, computational advertising crucially relies on the ability of machine learned systems to estimate accurately the probability of a user interacting with an ad in a meaningful manner. Scalable binary classification arises as a central tool in this context.
Recommendation: In many cases the options available to a user are considerably larger than what is practically feasible. For instance, when recommending movies it is infeasible to offer a choice among tens of thousands of items. Instead, only a small subset is selected for inspection by the user. This situation is analogous in social recommendation, search, and the design of adaptive user interfaces.
Profiling: A key tool in recommendation is the ability to understand a user's desires, preferences, and needs. While elicitation by queries is biased and disruptive for a user, adaptive preference extraction can work behind the scenes using the behavioral trail. Such insight is crucial for accurate recommendation and advertising.

The Parameter Server

When designing data analysis algorithms, an important challenge is to balance the need of flexibility and generality of machine learning algorithms and the simplicity of systems design. A convenient extension beyond batch-synchronous designs is that of a bipartite graph where clients and servers exchange a partially shared state.

This design, usually referred to as a parameter server has proven successful in scaling to very large problems over the past 5 years. In the tutorial we will discuss a particular variant, as implemented on Parameterserver.org that allows one to scale machine learning systems to hundreds of billions of samples and thousands of machines. It is an open source project hosted at CMU with input from Baidu IDL and and Google. Some of its key features are as follows:

Efficient communication: All communication is asynchronous. It is optimized for machine learning tasks to reduce network traffic and overhead.
Flexible consistency models: The system provides flexible consistency models to allow the algorithm designer to balance algorithmic convergence rate and system efficiency, where the best trade-off depends on data, algorithm, and hardware.
Elastic Scalability: New nodes can be added without restarting the running framework.
Fault Tolerance and Durability: Recovery from and repair of non-catastraphic machine failures within several seconds, without interrupting computation.
Ease of Use: The globally shared parameters are represented as (potentially sparse) vectors and matrices to facilitate development of machine learning applications. The linear algebra data types come with high-performance multi-threaded linear algebra libraries.

The tutorial will discuss a number of problems and how they can be mapped into the parameter server framework.

Applications

To illustrate the use of the tools listed above, we discuss some applications in somewhat greater detail:

Time-dependent User Profiling: When recommending items to users it is vital to understand what their preferences are and how they might be changing over time. Moreover, it is equally vital to understand whether and how global changes in interest and preference might be affecting individual activity. We give an overview over an integrated model and show how to perform inference at industrial scale using the parameterserver framework.
Nested Chinese Restaurant Franchise: In many cases a simple flat topic model is inadequate for modeling preferences. For instance, when it comes to capturing location and content of microblogs it is preferable to have the flexibility of capturing multiple areas of location in varying detail. Th same also applies to content preferences (e.g. general photography vs. users of a specific lens system). This requires distributions over trees of topics. We give an overview of the associated generative model and how to perform sampling inference in it.

Distributed Optimization

A key primitive in solving large scale problems, both for risk minimization and for probabilistic modeling is optimization. Based on two examples we show how this can be achieved efficiently at scale.

Sparse Logistic Regression: This is one of the simplest cases of nontrivial optimization. It is one of the key drivers to many computational advertising platforms, since compact models immediately translate into higher throughput, hence they are highly desirable. We show how this can be implemented efficiently using a parameter server on half a petabyte of data.
Distributed Stochastic Variational Inference: For probabilistic modeling matters are slightly less trivial since the problem is typically nonconvex, gradients are stochastic, and distributed initialization is challenging. By combining fast samplers, and adaptive stepsize dynamics we show how this can be accomplished on over 4 billion user profiles.