AI & Statistics 2016

AISTATS 2016 Invited Speakers

Kamalika Chaudhuri, University of California, San Diego

Challenges in Privacy-Preserving Data Analysis

Abstract Machine learning algorithms increasingly work with sensitive information on individuals, and hence the problem of privacy-preserving data analysis -- how to design data analysis algorithms that operate on the sensitive data of individuals while still guaranteeing the privacy of individuals in the data-- has achieved great practical importance. In this talk, we address two problems in privacy-preserving data analysis.

First, we address the problem of privacy-preserving classification, and present an efficient classifier which is private in the differential privacy model of Dwork et al. Our classifier works in the ERM (empirical loss minimization) framework, and includes privacy preserving logistic regression and privacy preserving support vector machines.

We next address the question of learning from sensitive correlated data, such as private information on users connected together in a social network, and measurements of physical activity of a single user across time. Unfortunately differential privacy cannot adequately address privacy challenges in this kind of data, and as such, these challenges have been largely ignored by existing literature. We consider a recent generalization of differential privacy, called Pufferfish, that can be used to address privacy in correlated data, and present new privacy mechanisms in this framework. Based on joint work with Claire Monteleoni (George Washington University), Anand Sarwate (Rutgers), Yizhen Wang (UCSD) and Shuang Song (UCSD).

Bio Kamalika Chaudhuri is an Assistant Professor in the Computer Science and Engineering Department at UC San Diego. Prior to joining the department, she received a PhD in Computer Science from UC Berkeley in 2007, and was a postdoctoral researcher at UC San Diego from 2007-2010. She is the recipient of a Hellman Faculty Fellowship and she received an NSF CAREER award in 2012. Kamalika's research is on learning theory, which deals with the theoretical foundations of machine learning. She is particularly interested in privacy-preserving machine learning -- how to learn a good predictor from sensitive training data, while ensuring the privacy of individuals in the data set.

Adam Tauman Kalai, Microsoft Research

Crowdsourced Data Representations

Abstract People understand many domains more deeply than today's machine learning systems. Having a good representation for a problem is crucial to the success of intelligent systems. In this talk, we discuss recent work and future opportunities for how humans can aid machine learning algorithms. Beyond simply labeling data, the crowd can help uncover the latent representation behind a problem. We discuss recent work on eliciting features using active learning as well as other aspects of crowdsourcing and machine learning, such as how crowdsourcing can help generate data, raise questions, and assist in more complex AI tasks.

Bio Adam Tauman Kalai received his BA (1996) from Harvard, and MA (1998) and PhD (2001) under the supervision of Avrim Blum from CMU. After an NSF postdoctoral fellowship at M.I.T. with Santosh Vempala, he served as an assistant professor at the Toyota Technological institute at Chicago and then at Georgia Tech. He is now a Principal Researcher at Microsoft Research New England. His honors include an NSF CAREER award and an Alfred P. Sloan fellowship. His research focuses on human computation, machine learning, and algorithms.

Richard Samworth, University of Cambridge

Random Projection Ensemble Classification

Abstract We introduce a very general method for high-dimensional classification, based on careful combination of the results of applying an arbitrary base classifier to random projections of the feature vectors into a lower-dimensional space. In one special case that we study in detail, the random projections are divided into non-overlapping blocks, and within each block we select the projection yielding the smallest estimate of the test error. Our random projection ensemble classifier then aggregates the results of applying the base classifier on the selected projections, with a data-driven voting threshold to determine the final assignment. Our theoretical results elucidate the effect on performance of increasing the number of projections. Moreover, under a boundary condition implied by the sufficient dimension reduction assumption, we show that the test excess risk of the random projection ensemble classifier can be controlled by terms that do not depend on the original data dimension. The classifier is also compared empirically with several other popular high-dimensional classifiers via an extensive simulation study, which reveals its excellent finite-sample performance.

Bio Richard Samworth is Professor of Statistics in the Statistical Laboratory at the University of Cambridge, and currently holds a GBP 1.2M Engineering and Physical Sciences Research Council Early Career Fellowship. He received his PhD in Statistics, also from the University of Cambridge, in 2004. Richard's main research interests are in nonparametric and high-dimensional statistical inference. Particular research topics include shape-constrained density and other nonparametric function estimation problems, nonparametric classification, clustering and regression, Independent Component Analysis, bagging and high-dimensional variable selection problems. Richard was awarded the Royal Statistical Society (RSS) Research prize (2008), the RSS Guy Medal in Bronze (2012) and a Philip Leverhulme prize (2014). He has been elected a Fellow of the Institute for Mathematical Statistics (2014) and the American Statistical Association (2015).