# AISTATS 2017 Program of Events

## Best Paper Awards

A Sub-Quadratic Exact Medoid Algorithm
James Newling, Francois Fleuret
Phase Retrieval Meets Statistical Learning Theory: A Flexible Convex Relaxation
Sohail Bahmani, Justin Romberg
Reparameterization Gradients through Acceptance- Rejection Sampling Algorithms
Christian Naesseth, Francisco Ruiz, Scott Linderman, David Blei
## 19-Apr (Wed)

**18:30-20:30** Registration Desk

## 20-Apr (Thu)

**7:30-8:50** Breakfast, Windows on the Green & Chart Room

**8-10** Registration Desk

**8:50-9pm** Welcome and award announcement

**9:00-10:00** Invited Talk: Csaba Szepesvari. Crystal Ballroom 1, 2
**Stochastic linear bandits. **
See abstract. See slides.

*Learning and decision making often conflict: A decision maker who is uncertain about its environment may need to choose actions whose main benefit is to gain information rather gaining reward. The dilemma between exploration and exploitation is at the heart of bandit problems. In this talk I will focus on the so-called stochastic linear bandits where the payoff function is assumed to posses a linear structure, an assumption that proved to be extremely effective elsewhere in machine learning. Here we look into how the linear structure can be exploited in bandits. I will discuss existing results and open questions. The issues discussed will include limits of performance of bandit algorithms, how to design efficient and effective algorithms, or how to exploit additional information like sparsity.*

__Bio__: Csaba Szepesvari is interested in designing principled learning algorithms for agents that learn from and control their environments. He is the author or coauthor of two books on the topic, as well as nearly 200 journal and conference papers. He is best known as the co-inventor of "UCT", a tree-search algorithm that inspired much subsequent work in AI, search and optimization and serves, among other things, as the basis of the search component in AlphaGo, Deepmind's Go program that defeated a top human Go player in the beginning of 2016. His work on UCT was recently recognized by the "Test of Time" award at ECML'2016. Csaba serves as an action editor of the Journal of Machine Learning Research and the Machine Learning Journal. He also served as a co-chair for COLT and ALT and has served as a senior PC member for first-tier AI and machine learning conferences for too many years. Currently Csaba is a Professor at the Department of Computing Science of the University of Alberta and a Principal Investigator of the Alberta Machine Intelligence Institute (AMII). From August, he will be joining Deepmind, London.

**10:00-10:30** Coffee Break, Crystal Atrium

**10:30-12:10** __Online Learning__, Crystal Ballroom 1, 2

*Session Chair: Csaba Szepesvari*

63 Linear Thompson Sampling Revisited

217 Horde of Bandits using Gaussian Markov Random Fields

225 The End of Optimism? An Asymptotic Analysis of Finite-Armed Linear Bandits

304 Improved Strongly Adaptive Online Learning using Coin Betting

**12:10-2:00** Lunch on your own

**1:00-3:00** Registration Desk

**2:00-3:40** __Nonparametric methods__, Crystal Ballroom 1, 2

*Session Chair: Byron Boots*

97 Poisson intensity estimation with reproducing kernels

249 Attributing Hacks

248 Regression Uncertainty on the Grassmannian

401 Modal-set estimation with an application to clustering

**3:40-4:10** Coffee break, Crystal Atrium

**4:10-7:00** Poster Session (with light snacks), Crystal Ballroom 3, 4

See poster list.
tP01: 352 Scalable variational inference for super resolution microscopy

tP02: 342 Complementary Sum Sampling for Likelihood Approximation in Large Scale Classification

tP03: 249 Attributing Hacks

tP04: 317 Fairness Constraints: Mechanisms for Fair Classification

tP05: 116 Encrypted accelerated least squares regression

tP06: 294 DP-EM: Differentially Private Expectation Maximization

tP07: 539 Greedy Direction Method of Multiplier for MAP Inference of Large Output Domain

tP08: 328 A Learning Theory of Ranking Aggregation

tP09: 406 Relativistic Monte Carlo

tP10: 119 Gray-box inference for structured Gaussian process models

tP11: 11 Clustering from Multiple Uncertain Experts

tP12: 139 Combinatorial Topic Models using Small-Variance Asymptotics

tP13: 212 High-dimensional Time Series Clustering via Cross-Predictability

tP14: 516 Convergence rate of stochastic k-means

tP15: 183 Fast column generation for atomic norm regularization

tP16: 492 Binary and Multi-Bit Coding for Stable Random Projections

tP17: 30 On the learnability of fully-connected neural networks

tP18: 362 Fast rates with high probability in exp-concave statistical learning

tP19: 526 Learning Graphical Games from Behavioral Data: Sufficient and Necessary Conditions

tP20: 441 Inference Compilation and Universal Probabilistic Programming

tP21: 199 Gradient Boosting on Stochastic Data Streams

tP22: 114 Localized Lasso for High-Dimensional Regression

tP23: 366 Learning with feature feedback: from theory to practice

tP24: 47 Consistent and Efficient Nonparametric Different-Feature Selection

tP25: 148 Rapid Mixing Swendsen-Wang Sampler for Stochastic Partitioned Attractive Models

tP26: 215 Learning Nonparametric Forest Graphical Models with Prior Information

tP27: 301 Efficient Algorithm for Sparse Tensor-variate Gaussian Graphical Models via Gradient Descent

tP28: 383 Belief Propagation in Conditional RBMs for Structured Prediction

tP29: 393 A Fast and Scalable Joint Estimator for Learning Multiple Related Sparse Gaussian Graphical Models

tP30: 422 Learning the Network Structure of Heterogeneous Data via Pairwise Exponential Markov Random Fields

tP31: 99 Generalized Pseudolikelihood Methods for Inverse Covariance Estimation

tP32: 327 A New Class of Private Chi-Square Hypothesis Tests

tP33: 32 An Information-Theoretic Route from Generalization in Expectation to Generalization in Probability

tP34: 405 Local Group Invariant Representations via Orbit Embeddings

tP35: 439 Spatial Decompositions for Large Scale SVMs

tP36: 488 Distributed Adaptive Sampling for Kernel Matrix Approximation

tP37: 97 Poisson intensity estimation with reproducing kernels

tP38: 268 Scalable Learning of Non-Decomposable Objectives

tP39: 414 Fast Classification with Binary Prototypes

tP40: 497 Label Filters for Large Scale Multilabel Classification

tP41: 150 Efficient Rank Aggregation via Lehmer Codes

tP42: 325 A Unified Computational and Statistical Framework for Nonconvex Low-rank Matrix Estimation

tP43: 45 Tensor-Dictionary Learning with Deep Kruskal-Factor Analysis

tP44: 469 Optimal Recovery of Tensor Slices

tP45: 290 Hit-and-Run for Sampling and Planning in Non-Convex Spaces

tP46: 312 Black-box Importance Sampling

tP47: 361 Sequential Graph Matching with Sequential Monte Carlo

tP48: 134 On the Troll-Trust Model for Edge Sign Prediction in Social Networks

tP49: 153 Nonlinear ICA of Temporally Dependent Stationary Sources

tP50: 216 Sparse Randomized Partition Trees for Nearest Neighbor Search

tP51: 341 Structured adaptive and random spinners for fast machine learning computations

tP52: 401 Modal-set estimation with an application to clustering

tP53: 435 Lipschitz Density-Ratios, Structured Data, and Data-driven Tuning

tP54: 138 Online Optimization of Smoothed Piecewise Constant Functions

tP55: 178 Contextual Bandits with Latent Confounders: An NMF Approach

tP56: 200 Online Learning and Blackwell Approachability with Partial Monitoring: Optimal Convergence Rates

tP57: 217 Horde of Bandits using Gaussian Markov Random Fields

tP58: 221 Trading off Rewards and Errors in Multi-Armed Bandits

tP59: 225 The End of Optimism? An Asymptotic Analysis of Finite-Armed Linear Bandits

tP60: 250 Unsupervised Sequential Sensor Acquisition

tP61: 304 Improved Strongly Adaptive Online Learning using Coin Betting

tP62: 35 Nearly Instance Optimal Sample Complexity Bounds for Top-k Arm Selection

tP63: 63 Linear Thompson Sampling Revisited

tP64: 96 Regret Bounds for Lifelong Learning

tP65: 106 Regret Bounds for Transfer Learning in Bayesian Optimisation

tP66: 111 Scaling Submodular Maximization via Pruned Submodularity Graphs

tP67: 23 Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach

tP68: 360 Linear Convergence of Stochastic Frank Wolfe Variants

tP69: 386 Finite-sum Composition Optimization via Variance Reduced Gradient Descent

tP70: 40 Guaranteed Non-convex Optimization: Submodular Maximization over Continuous Domains

tP71: 467 Initialization and Coordinate Optimization for Multi-way Matching

tP72: 52 Less than a Single Pass: Stochastically Controlled Stochastic Gradient

tP73: 521 Scalable Convex Multiple Sequence Alignment via Entropy-Regularized Dual Decomposition

tP74: 193 Exploration-Exploitation in MDPs with Options

tP75: 515 Value-Aware Loss Function for Model-based Reinforcement Learning

tP76: 84 Learning Nash Equilibrium for General-Sum Markov Games from Batch Data

tP77: 324 Frequency Domain Predictive Modelling with Aggregated Data

tP78: 394 Communication-efficient Distributed Sparse Linear Discriminant Analysis

tP79: 402 Compressed Least Squares Regression revisited

tP80: 219 Random projection design for scalable implicit smoothing of randomly observed stochastic processes

tP81: 78 Learning Theory for Conditional Risk Minimization

tP82: 248 Regression Uncertainty on the Grassmannian

tP83: 299 Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems

tP84: 55 Learning Time Series Detection Models from Temporally Imprecise Labels

## 21-Apr (Fri)

**7:30-9:00** Breakfast, Windows on the Green & Chart Room

**8-10** Registration Desk

**9:00-10:00** Invited Talk, Cynthia Rudin, Crystal Ballroom 1, 2
**What Are We Afraid Of?: Computational Hardness vs the Holy Grail of Interpretability in Machine Learning.**
See abstract. See slides.

*
Is there always a tradeoff between accuracy and interpretability? This is a very old AI question. Many people have claimed that they have investigated the answer to this question, but it is not clear that these attempts have been truly serious. If we try to investigate this claim by comparing interpretable modeling algorithms (like decision trees - say CART, C4.5) to a black box method that optimizes only accuracy (SVM or neural networks), we will not find the answer. This is not a fulfilling comparison - the methods for producing interpretable models are greedy myopic methods with no global objective, whereas the black box algorithms have global objectives and principled optimization routines. In order to actually answer this question, we would have to compare an "optimal" interpretable model to an optimal black box model. This means we actually need optimality for interpretable models. This, of course, leads to computationally hardness, which scares us. On the other hand, we have computing power like never before. So do we truly know what we are afraid of any more?
In this talk I will discuss algorithms for interpretable machine learning. Some of these algorithms are designed to create certificates of nearness to optimality. I will focus on some of our most recent work, including (1) work on optimal rule list models using customized bounds and data structures (these are an alternative to CART) (2) work on optimal scoring systems (alternatives to logistic regression + rounding). Further, since we have methods that can produce optimal or near-optimal models, we can use them to produce interesting new forms of interpretable models. These new forms were simply not possible before, since they are almost impossible to produce using traditional techniques (like greedy splitting and pruning). In particular: (3) Falling rule lists, (4) Causal falling rule lists, and (5) Cost-effective treatment regimes. Work on (1) is joint with postdoc Elaine Angelino, students Nicholas Larus-Stone and Daniel Alabi, and colleague Margo Seltzer. Work on (2) is joint with student Berk Ustun. Work on (3) and (4) are joint with students Fulton Wang and Chaofan Chen, and (5) is an AISTATS 2017 paper that is joint work with student Himabindu Lakkaraju.*

__Bio:__ Cynthia Rudin is an associate professor of computer science and electrical and computer engineering at Duke University, and directs the Prediction Analysis Lab. Her interests are in machine learning, data mining, applied statistics, and knowledge discovery (Big Data). Her application areas are in energy grid reliability, healthcare, and computational criminology. Previously, Prof. Rudin held positions at MIT, Columbia, and NYU. She holds an undergraduate degree from the University at Buffalo where she received the College of Arts and Sciences Outstanding Senior Award in Sciences and Mathematics, and three separate outstanding senior awards from the departments of physics, music, and mathematics. She received a PhD in applied and computational mathematics from Princeton University. She is the recipient of the 2013 and 2016 INFORMS Innovative Applications in Analytics Awards, an NSF CAREER award, was named as one of the "Top 40 Under 40" by Poets and Quants in 2015, and was named by Businessinsider.com as one of the 12 most impressive professors at MIT in 2015. Work from her lab has won 10 best paper awards in the last 5 years. Her work has been featured in Businessweek, The Wall Street Journal, the New York Times, the Boston Globe, the Times of London, Fox News (Fox & Friends), the Toronto Star, WIRED Science, U.S. News and World Report, Slashdot, CIO magazine, Boston Public Radio, and on the cover of IEEE Computer. She is past chair of the INFORMS Data Mining Section, and is currently chair-elect of the Statistical Learning and Data Science section of the American Statistical Association.

**10:00-10:30** Coffee Break, Crystal Atrium

**10:30-12:10** __Theory__, Crystal Ballroom 1, 2

*Session Chair: Sanjoy Dasgupta*
94 Phase Retrieval Meets Statistical Learning Theory: A Flexible Convex Relaxation

68 A Sub-Quadratic Exact Medoid Algorithm

456 On the Interpretability of Conditional Probability Estimates in the Agnostic Setting

209 Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers

**12:10-2:00** Lunch on your own

**1:00-3:00** Registration Desk

**2:00-3:40** __Approximate Inference and MCMC__, Crystal Ballroom 1, 2

*Session Chair: Simon Lacoste-Julien*
51 Annular Augmentation Sampling

101 Removing Phase Transitions from Gibbs Measures

170 Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms

174 Asymptotically exact inference in differentiable generative models

**3:40-4:10** Coffee Break, Crystal Atrium

**4:10-7:00** Poster Session (with light snacks), Crystal Ballroom 3, 4

See poster list.
fP01: 82 Near-optimal Bayesian Active Learning with Correlated and Noisy Tests

fP02: 9 Large-Scale Data-Dependent Kernel Approximation

fP03: 86 Distance Covariance Analysis

fP04: 228 Rank Aggregation and Prediction with Item Features

fP05: 420 Signal-based Bayesian Seismic Monitoring

fP06: 60 Learning Cost-Effective and Interpretable Treatment Regimes

fP07: 170 Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms

fP08: 174 Asymptotically exact inference in differentiable generative models

fP09: 288 Conjugate-Computation Variational Inference : Converting Variational Inference in Non-Conjugate Models to Inferences in Conjugate Models

fP10: 196 Local Perturb-and-MAP for Structured Prediction

fP11: 51 Annular Augmentation Sampling

fP12: 104 Performance Bounds for Graphical Record Linkage

fP13: 180 Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets

fP14: 273 CPSG-MCMC: Clustering-Based Preprocessing method for Stochastic Gradient MCMC

fP15: 298 On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior

fP16: 345 Learning Optimal Interventions

fP17: 419 Learning Structured Weight Uncertainty in Bayesian Neural Networks

fP18: 429 Discovering and Exploiting Additive Structure for Bayesian Optimization

fP19: 211 Detecting Dependencies in Sparse, Multivariate Databases Using Probabilistic Programming and Non-parametric Bayes

fP20: 416 Prediction Performance After Learning in Gaussian Process Regression

fP21: 129 A Framework for Optimal Matching for Causal Inference

fP22: 523 Robust Causal Estimation in the Large-Sample Limit without Strict Faithfulness

fP23: 182 Least-Squares Log-Density Gradient Clustering for Riemannian Manifolds

fP24: 68 A Sub-Quadratic Exact Medoid Algorithm

fP25: 117 Random Consensus Robust PCA

fP26: 224 Adaptive ADMM with Spectral Penalty Parameter Selection

fP27: 94 Phase Retrieval Meets Statistical Learning Theory: A Flexible Convex Relaxation

fP28: 214 Data Driven Resource Allocation for Distributed Learning

fP29: 278 Comparison-Based Nearest Neighbor Search

fP30: 363 Generalization Error of Invariant Classifiers

fP31: 456 On the Interpretability of Conditional Probability Estimates in the Agnostic Setting

fP32: 141 ConvNets with Smooth Adaptive Activation Functions for Regression

fP33: 404 Diverse Neural Network Learns True Target Functions

fP34: 209 Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers

fP35: 329 Anomaly Detection in Extreme Regions via Empirical MV-sets on the Sphere

fP36: 69 Minimax density estimation for growing dimension

fP37: 540 Scalable Greedy Feature Selection via Weak Submodularity

fP38: 449 Information Projection and Approximate Inference for Structured Sparse Variables

fP39: 227 Dynamic Collaborative Filtering With Compound Poisson Factorization

fP40: 242 Information-theoretic limits of Bayesian network structure learning

fP41: 3 Conditions beyond treewidth for tightness of higher-order LP relaxations

fP42: 347 A Lower Bound on the Partition Function of Attractive Graphical Models in the Continuous Case

fP43: 531 Non-Count Symmetries in Boolean & Multi-Valued Prob. Graphical Models

fP44: 504 Sequential Multiple Hypothesis Testing with Type I Error Control

fP45: 22 Lower Bounds on Active Learning for Graphical Model Selection

fP46: 498 Learning from Conditional Distributions via Dual Embeddings

fP47: 13 Online Nonnegative Matrix Factorization with General Divergences

fP48: 161 Stochastic Difference of Convex Algorithm and its Application to Training Deep Boltzmann Machines

fP49: 384 Sketching Meets Random Projection in the Dual: A Provable Recovery Algorithm for Big and High-dimensional Data

fP50: 417 Communication-Efficient Learning of Deep Networks from Decentralized Data

fP51: 520 Automated Inference with Adaptive Batches

fP52: 190 Bayesian Hybrid Matrix Factorisation for Data Integration

fP53: 192 Co-Occurring Directions Sketching for Approximate Matrix Multiply

fP54: 205 Tensor Decompositions via Two-Mode Higher-Order SVD (HOSVD)

fP55: 442 Active Positive Semidefinite Matrix Completion: Algorithms, Theory and Applications

fP56: 245 Markov Chain Truncation for Doubly-Intractable Inference

fP57: 101 Removing Phase Transitions from Gibbs Measures

fP58: 484 Distribution of Gaussian Process Arc Lengths

fP59: 76 Estimating Density Ridges by Direct Estimation of Density-Derivative-Ratios

fP60: 213 Minimax Approach to Variable Fidelity Data Interpolation

fP61: 132 Stochastic Rank-1 Bandits

fP62: 26 Sparse Accelerated Exponential Weights

fP63: 479 Efficient Online Multiclass Prediction on Graphs via Surrogate Losses

fP64: 124 Frank-Wolfe Algorithms for Saddle Point Problems

fP65: 167 Global Convergence of Non-Convex Gradient Descent for Computing Matrix Squareroot

fP66: 175 Decentralized Collaborative Learning of Personalized Models over Networks

fP67: 20 ASAGA: Asynchronous Parallel SAGA

fP68: 264 A Stochastic Nonconvex Splitting Method for Symmetric Nonnegative Matrix Factorization

fP69: 282 A Unified Optimization View on Generalized Matching Pursuit and Frank-Wolfe

fP70: 284 Faster Coordinate Descent via Adaptive Importance Sampling

fP71: 375 Tracking Objects with Higher Order Interactions via Delayed Column Generation

fP72: 399 Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage

fP73: 367 Optimistic Planning for the Stochastic Knapsack Problem

fP74: 410 Thompson Sampling for Linear-Quadratic Control Problems

fP75: 239 Robust and Efficient Computation of Eigenvectors in a Generalized Spectral Method for Constrained Clustering

fP76: 302 Minimax-optimal semi-supervised regression on unknown manifolds

fP77: 2 Minimax Gaussian Classification & Clustering

fP78: 372 Identifying groups of strongly correlated variables through Smoothed Ordered Weighted L_1-norms

fP79: 494 Spectral Methods for Correlated Topic Models

fP80: 131 Quantifying the accuracy of approximate diffusions and Markov chains

fP81: 267 Hierarchically-partitioned Gaussian Process Approximation

fP82: 459 Linking Micro Event History to Macro Prediction in Point Process Models

fP83: 507 A Maximum Matching Algorithm for Basis Selection in Spectral Learning

**7:15-9:00** Dinner Buffet, Panorama Ballroom

## 22-Apr (Sat)

**7:30-9:00** Breakfast, Panorama Ballroom C, D & Terrace

**8-10** Registration Desk

**9:00-10:00** Invited Talk: Sanjoy Dasgupta. Panorama Ballroom A, B

**Towards a Theory of Interactive Learning.**
See abstract. See slides.

*"Interactive learning" refers to scenarios in which a learning agent (human or machine) engages with an information-bearing agent or system (for instance, a human expert) with the goal of efficiently arriving at a useful model. Examples include: active learning of classifiers; automated teaching systems; augmenting unsupervised learning with interactive post-editing; and so on. In particular, such interaction is a basic mechanism by which we can communicate our needs and preferences to the computers that play an increasing role in our lives.
It would be helpful to have unifying mathematical frameworks that can provide a basis for evaluating interactive schemes, and that supply generic interaction algorithms. I will describe one such mathematical framework, that covers a fairly broad range of situations, and illustrate how it yields algorithms for interactive hierarchical clustering and interactive topic modeling.*

__Bio__: Sanjoy Dasgupta is a Professor in the Department of Computer Science and Engineering at UC San Diego. He received his PhD at UC Berkeley in 2000. He works on algorithms for machine learning, with a focus on unsupervised and interactive learning. He is the author of a textbook, 'Algorithms', with Christos Papadimitriou and Umesh Vazirani. He was program co-chair for the Conference on Learning Theory (COLT) in 2009 and for the International Conference on Machine Learning (ICML) in 2013.

**10:00-10:30** Coffee Break, Panorama Foyer

**10:30-12:10** __Bayesian Methods__, Panorama Ballroom A, B

*Session Chair: Rebecca Steorts*
420 Signal-based Bayesian Seismic Monitoring

180 Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets

82 Near-optimal Bayesian Active Learning with Correlated and Noisy Tests

298 On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior

**12:10-1:30** Lunch on your own

**(note shorter lunch)**
**1:30-3:10** __Large-scale learning__, Panorama Ballroom A, B

*Session Chair: Pradeep Ravikumar*
417 Communication-Efficient Learning of Deep Networks from Decentralized Data

520 Automated Inference with Adaptive Batches

224 Adaptive ADMM with Spectral Penalty Parameter Selection

372 Identifying groups of strongly correlated variables through Smoothed Ordered Weighted L_1-norms

**3:10-3:40** Coffee Break Panorama Foyer

**3:40-5:20** __Sketching__, Panorama Ballroom A, B

*Session Chair: Anastasios (Tasos) Kyrillidis*
384 Sketching Meets Random Projection in the Dual: A Provable Recovery Algorithm for Big and High-dimensional Data

399 Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage

192 Co-Occurring Directions Sketching for Approximate Matrix Multiply

117 Random Consensus Robust PCA