Here is some of the more theoretical work that I have researched over the past decade. The beauty of machine learning is that it can be fairly application agnostic. That is, many seemingly diverse problems, e.g. activity recognition in videos and the segmentation of documents into paragraph or optical character recognition and the detection of cancer share very similar underlying technology. Obviously it needs adjustment but the basic ideas tend to be fairly universal.


Much of estimation is nonstandard. That is, it goes beyond a simple binary classification or least-mean-square regression problem. At the same time, many of these problms fall into fairly prototypical categories of how things can be changed. I have been working on ways to allow for flexible model adjustment. This includes the following techniques:


Kernels are an effective means of dealing with similarity between observations and of describing classes of smooth functions. Obviously there exist other function spaces but kernels offer the significant advantage of providing a concise description of the evaluation functional. This means that it is very easy for practitioners to design meaningful kernels. My work includes the following:

Large Scale Optimization

When solving large scale problems with hundreds of millions of observations issues such as parallelization of computation and storage become relevant. This means that we need to design algorithms which can be stopped at almost any time, which scale gracefully with the number of computers available and which keep data locally. Bundle methods are one such set of methods. This is fairly exciting area of research at the moment, in particular with the advent of multicore processors and fast graphics cards.

Distribution Representation

Often scientists use information theory when dealing with distributions. This is very reasonable when working on information theoretic problems but not necessarily the best idea when it comes to statistical estimation such as risk minimization.

What is interesting about a distribution is often only its behavior when taking expectations. That is, when we are only interested in averages, we can group distributions into equivalence classes of distributions with the same mean. From the point of application they do not differ. This allows us to compute distances between distributions in terms of distances between averages. It turns out that there are necessary and sufficient representations in this context. In particular, averages Hilbert space are quite useful.

We have developed a framework of estimation based on this which includes density estimation, clustering, feature selection, two sample tests, independence tests, nonparametric sorting, and low-dimensional data representations. The advantage is that a large number of existing methods appear in this framework as special cases.