## Abstracts## Graphical Models for Point Pattern MatchingT. Caelli, T. Caetano, D. Schuurmans Graph matching plays a key role in many areas of computing from computer vision to networks where there is a need to determine correspondences between the components (vertices and edges) of two attributed structures. In recent years three new approaches to graph matching have emerged as replacements to more traditional heuristic methods. These new methods are: (1) least squares - where the optimal correspondence in determined in terms of deriving the the best fitting permutation matrix between sets; (2) spectral methods - where optimal correspondences are derived via subspace projections in the graph eigenspaces; (3) graphical models - where algorithms such as the junction tree algorithm are used to infer the optimal labelling of the nodes of one graph in terms of the the other and that satisfy similarity constraints between vertices and edges. In this lecture we review and compare these methods and demonstrate examples where this applies to point set and line matching. ## Linear models for structure predictionF. Perreira, A. Bernal, K. Crammer, R. McDonald Over the last few years, several groups have been developing models and algorithms for learning to predict the structure of complex data, sequences in particular, that extend well-known linear classification models and algorithms. These methods combine the advantages of discriminative learning with those of probabilistic generative models like HMMs. I review linear models for structure prediction and their simplest learning algorithms, and exemplify their benefits with applications to information extraction from biomedical text and gene finding. ## Sparse Multi Gaussian Process MethodsM. Seeger In order to move beyond “single process” models for binary classification and univariate regression, it is necessary to develop efficient approximations to inference and learning in models which make use of multiple dependent processes. Seemingly a simple extension from the single process case, several issues of representation and approximate computation become much harder in this case. We point out the difficulties and develop a general framework based on the Informative Vector Machine. We give several examples of how this framework can be applied to problems of multi-way and multi-label classification and to conditional sequence or more general random field models. Partly joint work with Michael Jordan and Yee-Whye Teh. ## Decentralized Detection and Marginalized KernelsX. Nguyen, M. J. Wainwright and M. I. Jordan Many real-world applications require a decentralized approach, since aggregating all data at a central location may be infeasible due to communication constraints (e.g., power limitations in sensor networks). We consider the problem of decentralized detection under constraints on the number of bits that can be transmitted by each sensor. In contrast to most previous work, in which the joint distribution of sensor observations is assumed to be known, we address the problem when only a set of empirical samples is available. We propose a novel algorithm that exploits the notion of a marginalized kernel, which has recently proved useful in deriving kernels from generative probablistic models. We show that our algorithm of learning in a decentralized decision-making system amounts to learning a marginalized kernel, where the marginalization is taken over sensor messages conditioned on the sensor observations. The distributed constraints allow for efficient computation in much the same way as the factorized representation of graphical models. We analyze the computational and statistical properties of our algorithm, and demonstrate its performance on both simulated and real sensor network data. ## Graphical Object Models for Detection and TrackingL. Sigal We present a probabilistic framework for component-based automatic detection and tracking of objects in images and/or video. We represent objects as spatio-temporal two-layer graphical models, where each node corresponds to an object or component of an object at a given time, and the edges correspond to learned spatial and temporal constraints. Object detection and tracking is formulated as inference over a directed loopy graph, and is solved with non-parametric belief propagation. This type of object model allows object-detection to make use of temporal consistency (over an arbitrary size temporal window), and facilitates robust tracking of the object. The two layer structure of the graphical model allows inference over the the entire object as well as individual components. AdaBoost detectors are used to define the likelihood and form proposal distributions for components. Proposal distributions provide ‘bottom-up’ information that is incorporated into the inference process; this permits automatic object detection and aids recovery from transient tracking failures. We illustrate our method by detecting and tracking two classes of objects, vehicles and pedestrians, in video sequences collected using a single grayscale uncalibrated car-mounted moving camera. ## A Family of Probabilistic Kernels Based on Information DivergenceA. B. Chan, N. Vasconcelos, and P. J. Moreno Probabilistic kernels offer a way to combine generative models with discriminative classifiers. We establish connections between probabilistic kernels and feature space kernels through a geometric interpretation of the previously proposed probability product kernel. A family of probabilistic kernels, based on information divergence measures, is then introduced and its connections to various existing probabilistic kernels are analyzed. The new family is shown to provide a unifying framework for the study of various important questions in kernel theory and practice. We exploit this property to design a set of experiments that yield interesting results regarding the role of properties such as linearity, positive definiteness, and the triangle inequality in kernel performance. ## Modified Maximum Margin Markov NetworksF. Perez-Cruz, Z. Ghahramani and M. Pontil Lately there has been great interest in combining graphical models and discriminative learning, mainly using maximum margin classifiers. There have been several attempts to relate SVMs and graphical models (e.g. Altun et al., 2003; Taskar et al., 2003). Taskar et al (2003) provided a reduction of the computational complexity (from exponentially to polynomial) of training a multiclass Support Vector Machine where the output classes are structured according to a Markov Network. In this talk we present a further reduction of the complexity of this maximum margin Markov (M^3) Network. We use an algebraic transformation to reduce the number of variables in the dual in the M^3 network. This reduction leads to a more comprehensible dual, which is formed by independent primal problems per clique. The use of this reduction mechanism also allows a better understanding of how the variables interact in the primal space. We show an application of this modified maximum margin Markov network to solving the channel equalization problem, which is a major component in digital communications. Finally, we will address some issues on how the per-clique class predictions can be combined to obtain full predictions and we discuss extensions to probabilistic interpretations of the outputs. ## Large Margin Latent Graphical ModelsT. Jebara Graphical models such as Bayesian networks, distributions, and hidden Markov models are elegant formalisms to setup and specify prior knowledge about a learning problem. However, unlike kernel methods, the standard estimation methods graphical models rely on, including maximum likelihood and Bayesian integration do not focus modeling resources on a particular input-output task. They only generically describe the data. In applied settings when models are imperfectly matched to real data, more discriminative learning as in kernel or support vector machines is crucial for improving performance. In this talk, I show how we can learn generative models optimally for a given task such as classification and obtain large margin discrimination boundaries. Through maximum entropy discrimination, all exponential family models can be discriminative via convex programming. Furthermore, the method handles interesting latent graphical models such as mixtures and hidden Markov models. This is done via a variant of the maximum entropy that uses variational bounding (two applications of Jensen's inequality) on classification constraints to make computations tractable in the latent case. Interestingly, the method gives rise to Lagrange multipliers that behave like posteriors over hidden variables. As in the E-step of the Expectation-Maximization method, these Lagrange multiplier posteriors inherit the highly factorized structure of graphical models and are amenable to efficient algorithms. ## Comparison of Recent Kernel Methods for Graphical ModelsY. Altun Kernel methods for graphical models combine advantages of two important areas of machine learning: From graphical models, they inherit the property of exploiting correlations between multiple labels which constitute a structure.As kernel methods, they can also learn non-linear discriminant functions and thus overcome limitations of parametric methods which exhibit conceptual and computational difficulties in high-dimensional input spaces.Recent work on this hybrid field can be classified as generalizations of two kernel methods to structure learning: Maximum Margin Markov Networks TGK04 and Hidden Markov Support Vector Machines ATH03 as generalizations of SVMs and Kernel Conditional Random Fields LZL04 and Gaussian Process Sequence Classification AHS04 as generalizations of Gaussian Processes. These approaches as well as others such as perceptron for structured domains CD01 differ in their objective functions, parameterizations and optimization methods. In this talk, we investigate these differences and provide their theoretical and empirical comparisons. ## Semidefinite Relaxations for MAP Estimation in Exponential FamiliesA. J. Smola, R. I. Kondor, S V N Vishwanathan and T. Jebara We apply the semidefinite relaxations introduced by Wainwright and Jordan to Maximum a Posteriori (MAP) estimation in graphical models. In particular, we focus on intractable conditional random fields. The semidefinite relaxations are used to bound the negative log-posterior, providing a convex alternative to the conventional approximations used for MAP estimation. We then discuss the extensions of our method to handle non-parametric estimation in feature space using graphical models. Preliminary experimental results are also presented. ## Kernel Methods for Missing VariablesA. J. Smola, S V N Vishwanathan and T. Hofmann We present methods for dealing with missing variables in the context of Gaussian Processes and Support Vector Machines. This solves an important problem which has largely been ignored by kernel methods: How to systematically deal with incomplete data? Our method can also be applied to problems with partially observed labels as well as to the transductive setting where we view the labels as missing data. Our approach relies on casting kernel methods as an estimation problem in exponential families. Hence, estimation with missing variables becomes a problem of computing marginal distributions, and finding efficient optimization methods. ## Max-Margin Associative Markov nets for 3D visionBen Taskar Markov networks are extensively used to model complex sequential, spatial, and relational interactions. However, inference and learning in general Markov networks is intractable. We focus on learning a large subclass of such models (called associative Markov networks (AMNs) that is tractable or closely approximable. This subclass contains networks of discrete variables with K labels each and clique potentials that favor the same labels for all variables in the clique. Our approach exploits a linear programming relaxation for the task of finding the best joint assignment in such networks, which provides an approximate quadratic program (QP) for maximum margin estimation. We show that for associative Markov network over binary-valued variables, this approximate QP is guaranteed to return an optimal parameterization for Markov networks of arbitrary topology. For the non-binary case, optimality is not guaranteed, but the relaxation produces good solutions in practice. Our max-margin formulation allows efficient use of kernels to define potentials within the graphical model, producing sparse yet rich high-dimensional representations. We use AMNs to address the problem of segmenting 3D scan data into objects or object classes. We test our approach on three large-scale datasets produced by different kinds of 3D sensors, showing its applicability to both outdoor and indoor environments containing diverse objects. Joint work with Vassil Chatalbashev, Dragomir Anguelov, Daphne Koller, Dinkar Gupta, Geremy Heitz, Andrew Ng. |