# FODS '20: Proceedings of the 2020 ACM-IMS on Foundations of Data Science Conference

# FODS '20: Proceedings of the 2020 ACM-IMS on Foundations of Data Science Conference

Full Citation in the ACM Digital Library## SESSION: Keynote Talk I

### Session details: Keynote Talk I

### AutoML and Interpretability: Powering the Machine Learning Revolution in Healthcare

An AutoML and interpretability are both fundamental to the successful uptake of machine learning by non-expert end users. The former will lower barriers to entry and unlock potent new capabilities that are out of reach when working with ad-hoc models, while the latter will ensure that outputs are transparent, trustworthy, and meaningful. In healthcare, AutoML and interpretability are already beginning to empower the clinical community by enabling the crafting of actionable analytics that can inform and improve decision-making by clinicians, administrators, researchers, policymakers, and beyond.

This keynote presents state-of-the-art AutoML and interpretability methods for healthcare developed in our lab and how they have been applied in various clinical settings (including cancer, cardiovascular disease, cystic fibrosis, and recently Covid-19), and then explains how these approaches form part of a broader vision for the future of machine learning in healthcare.

## SESSION: Session 1: Methodology

### Session details: Session 1: Methodology

### ADAGES: Adaptive Aggregation with Stability for Distributed Feature Selection

In this era of big data, not only the large amount of data keeps motivating distributed computing, but concerns on data privacy also put forward the emphasis on distributed learning. To conduct feature selection and to control the false discovery rate in a distributed pattern with multi-machines or multi-institutions, an efficient aggregation method is necessary. In this paper, we propose an adaptive aggregation method called ADAGES which can be flexibly applied to any machine-wise feature selection method. We will show that our method is capable of controlling the overall FDR with a theoretical foundation while maintaining power as good as the Union aggregation rule in practice.

### Classification Acceleration via Merging Decision Trees

We study the problem of merging decision trees: Given k decision trees $T_1,T_2,T_3...,T_k$, we merge these trees into one super tree T with (often) much smaller size. The resultant super tree T, which is an integration of k decision trees with each leaf having a major label, can also be considered as a (lossless) compression of a random forest. For any testing instance, it is guaranteed that the tree T gives the same prediction as the random forest consisting of $T_1,T_2,T_3...,T_k$ but it saves the computational effort needed for traversing multiple trees. The proposed method is suitable for classification problems with time constraints, for example, the online classification task such that it needs to predict a label for a new instance before the next instance arrives. Experiments on five datasets confirm that the super tree T runs significantly faster than the random forest with k trees. The merging procedure also saves space needed storing those k trees, and it makes the forest model more interpretable, since naturally one tree is easier to be interpreted than k trees.

### Tree Space Prototypes: Another Look at Making Tree Ensembles Interpretable

Ensembles of decision trees perform well on many problems, but are not interpretable. In contrast to existing approaches in interpretability that focus on explaining relationships between features and predictions, we propose an alternative approach to interpret tree ensemble classifiers by surfacing representative points for each class -- prototypes. We introduce a new distance for Gradient Boosted Tree models, and propose new, adaptive prototype selection methods with theoretical guarantees, with the flexibility to choose a different number of prototypes in each class. We demonstrate our methods on random forests and gradient boosted trees, showing that the prototypes can perform as well as or even better than the original tree ensemble when used as a nearest-prototype classifier. In a user study, humans were better at predicting the output of a tree ensemble classifier when using prototypes than when using Shapley values, a popular feature attribution method. Hence, prototypes present a viable alternative to feature-based explanations for tree ensembles.

### Ensembles of Bagged TAO Trees Consistently Improve over Random Forests, AdaBoost and Gradient Boosting

Ensemble methods based on trees, such as Random Forests, AdaBoost and gradient boosting, are widely recognized as among the best off-the-shelf classifiers: they typically achieve state-of-the-art accuracy in many problems with little effort in tuning hyperparameters, and they are often used in applications, possibly combined with other methods such as neural nets. While many variations of forest methods exist, using different diversity mechanisms (such as bagging, feature sampling or boosting), nearly all rely on training individual trees in a highly suboptimal way using greedy top-down tree induction algorithms such as CART or C5.0. We study forests where each tree is trained on a bootstrapped or random sample but using the recently proposed tree alternating optimization (TAO), which is able to learn trees that have both fewer nodes and lower error. The better optimization of individual trees translates into forests that achieve higher accuracy but using fewer, smaller trees with oblique nodes. We demonstrate this in a range of datasets and with a careful study of the complementary effect of optimization and diversity in the construction of the forest. These bagged TAO trees improve consistently and by a considerable margin over Random Forests, AdaBoost, gradient boosting and other forest algorithms in every single dataset we tried.

## SESSION: Session 2: Fairness, Privacy, Interpretability

### Session details: Session 2: Fairness, Privacy, Interpretability

### Interpreting Black Box Models via Hypothesis Testing

In science and medicine, model interpretations may be reported as discoveries of natural phenomena or used to guide patient treatments. In such high-stakes tasks, false discoveries may lead investigators astray. These applications would therefore benefit from control over the finite-sample error rate of interpretations. We reframe black box model interpretability as a multiple hypothesis testing problem. The task is to discover "important" features by testing whether the model prediction is significantly different from what would be expected if the features were replaced with uninformative counterfactuals. We propose two testing methods: one that provably controls the false discovery rate but which is not yet feasible for large-scale applications, and an approximate testing method which can be applied to real-world data sets. In simulation, both tests have high power relative to existing interpretability methods. When applied to state-of-the-art vision and language models, the framework selects features that intuitively explain model predictions. The resulting explanations have the additional advantage that they are themselves easy to interpret.

### Congenial Differential Privacy under Mandated Disclosure

Differentially private data releases are often required to satisfy a set of external constraints that reflect the legal, ethical, and logical mandates to which the data curator is obligated. The enforcement of constraints, when treated as post-processing, adds an extra phase in the production of privatized data. It is well understood in the theory of multi-phase processing that congeniality, a form of procedural compatibility between phases, is a prerequisite for the end users to straightforwardly obtain statistically valid results. Congenial differential privacy is theoretically principled, which facilitates transparency and intelligibility of the mechanism that would otherwise be undermined by ad-hoc post-processing procedures. We advocate for the systematic integration of mandated disclosure into the design of the privacy mechanism via standard probabilistic conditioning on the invariant margins. Conditioning automatically renders congeniality because any extra post-processing phase becomes unnecessary. We provide both initial theoretical guarantees and a Markov chain algorithm for our proposal. We also discuss intriguing theoretical issues that arise in comparing congenital differential privacy and optimization-based post-processing, as well as directions for further research.

### Incentives Needed for Low-Cost Fair Lateral Data Reuse

A central goal of algorithmic fairness is to build systems with fairness properties that compose gracefully. A major effort and step towards this goal in data science has been the development offair representations which guarantee demographic parity under sequential composition by imposing ademographic secrecy constraint. In this work, we elucidate limitations of demographically secret fair representations and propose a fresh approach to potentially overcome them by incorporating information about parties' incentives into fairness interventions. Specifically, we show that in a stylized model, it is possible to relax demographic secrecy to obtainincentive-compatible representations, where rational parties obtain exponentially greater utilities vis-à-vis any demographically secret representation and satisfy demographic parity. These substantial gains are recovered not from the well-knowncost of fairness, but rather from acost of demographic secrecy which we formalize and quantify for the first time. We further show that the sequential composition property of demographically secret representations is not robust to aggregation. Our results open several new directions for research in fair composition, fair machine learning and algorithmic fairness.

### Applying Algorithmic Accountability Frameworks with Domain-specific Codes of Ethics: A Case Study in Ecosystem Forecasting for Shellfish Toxicity in the Gulf of Maine

Ecological forecasts are used to inform decisions that can havesignificant impacts on the lives of individuals and on the healthof ecosystems. These forecasts, or models, embody the ethics oftheir creators as well as many seemingly arbitrary implementationchoices made along the way. They can contain implementationerrors as well as reflect patterns of bias learned when ingestingdatasets derived from past biased decision making. Principles andframeworks for algorithmic accountability allow a wide range ofstakeholders to place the results of models and software systemsinto context. We demonstrate how the combination of algorithmicaccountability frameworks and domain-specific codes of ethics helpanswer calls to uphold fairness and human values, specifically indomains that utilize machine learning algorithms. This helps avoidmany of the unintended consequences that can result from deploy-ing "black box" systems to solve complex problems. In this paper,we discuss our experience applying algorithmic accountability prin-ciples and frameworks to ecosystem forecasting, focusing on a casestudy forecasting shellfish toxicity in the Gulf of Maine. We adaptexisting frameworks such as Datasheets for Datasets and ModelCards for Model Reporting from their original focus on personallyidentifiable private data to include public datasets, such as thoseoften used in ecosystem forecasting applications, to audit the casestudy. We show how high level algorithmic accountability frame-works and domain level codes of ethics compliment each other,incentivizing more transparency, accountability, and fairness inautomated decision-making systems.

## SESSION: Keynote Talk II

### Session details: Keynote Talk II

### Semantic Scholar, NLP, and the Fight against COVID-19

This talk will describe the dramatic creation of the COVID-19 Open Research Dataset (CORD-19) at the Allen Institute for AI and the broad range of efforts, both inside and outside of the Semantic Scholar project, to garner insights into COVID-19 and its treatment based on this data. The talk will highlight the difficult problems facing the emerging field of Scientific Language Processing.

## SESSION: Session 3: Data Science Theory

### Session details: Session 3: Data Science Theory

### Non-Uniform Sampling of Fixed Margin Binary Matrices

Data sets in the form of binary matrices are ubiquitous across scientific domains, and researchers are often interested in identifying and quantifying noteworthy structure. One approach is to compare the observed data to that which might be obtained under a null model. Here we consider sampling from the space of binary matrices which satisfy a set of marginal row and column sums. Whereas existing sampling methods have focused on uniform sampling from this space, we introduce modified versions of two elementwise swapping algorithms which sample according to a non-uniform probability distribution defined by a weight matrix, which gives the relative probability of a one for each entry. We demonstrate that values of zero in the weight matrix, i.e. structural zeros, are generally problematic for swapping algorithms, except when they have special monotonic structure. We explore the properties of our algorithms through simulation studies, and illustrate the potential impact of employing a non-uniform null model using a classic bird habitation dataset.

### Large Very Dense Subgraphs in a Stream of Edges

We study the detection and the reconstruction of a large very dense subgraph in a social graph with n nodes and m edges given as a stream of edges, when the graph follows a power law degree distribution, in the regime when $m=O(n. łog n)$. A subgraph is very dense if its edge density is comparable to a clique. We uniformly sample the edges with a Reservoir of size $k=O(\sqrtn.łog n)$. The detection algorithm of a large very dense subgraph checks whether the Reservoir has a giant component. We show that if the graph contains a very dense subgraph of size $Ømega(\sqrtn )$, then the detection algorithm is almost surely correct. On the other hand, a random graph that follows a power law degree distribution almost surely has no large very dense subgraph, and the detection algorithm is almost surely correct. We define a new model of random graphs which follow a power law degree distribution and have large very dense subgraphs. We then show that on this class of random graphs we can reconstruct a good approximation of the very dense subgraph with high probability. We generalize these results to dynamic graphs defined by sliding windows in a stream of edges.

### Toward Communication Efficient Adaptive Gradient Method

In recent years, distributed optimization is proven to be an effective approach to accelerate training of large scale machine learning models such as deep neural networks. With the increasing computation power of GPUs, the bottleneck of training speed in distributed training is gradually shifting from computation to communication. Meanwhile, in the hope of training machine learning models on mobile devices, a new distributed training paradigm called "federated learning'' has become popular. The communication time in federated learning is especially important due to the low bandwidth of mobile devices. While various approaches to improve the communication efficiency have been proposed for federated learning, most of them are designed with SGD as the prototype training algorithm. While adaptive gradient methods have been proven effective for training neural nets, the study of adaptive gradient methods in federated learning is scarce. In this paper, we propose an adaptive gradient method that can guarantee both the convergence and the communication efficiency for federated learning.

### Towards Practical Lipschitz Bandits

Stochastic Lipschitz bandit algorithms balance exploration and exploitation, and have been used for a variety of important task domains. In this paper, we present a framework for Lipschitz bandit methods that adaptively learns partitions of context- and arm-space. Due to this flexibility, the algorithm is able to efficiently optimize rewards and minimize regret, by focusing on the portions of the space that are most relevant. In our analysis, we link tree-based methods to Gaussian processes. In light of our analysis, we design a novel hierarchical Bayesian model for Lipschitz bandit problems. Our experiments show that our algorithms can achieve state-of-the-art performance in challenging real-world tasks such as neural network hyperparameter tuning.

### On Reinforcement Learning for Turn-based Zero-sum Markov Games

We consider the problem of finding Nash equilibrium for two-player turn-based zero-sum games. Inspired by the AlphaGo Zero (AGZ) algorithm, we develop a Reinforcement Learning based approach. Specifically, we propose Explore-Improve-Supervise (EIS) method that combines "exploration", "policy improvement" and "supervised learning" to find the value function and policy associated with Nash equilibrium. We identify sufficient conditions for convergence and correctness for such an approach. For a concrete instance of EIS where random policy is used for "exploration", Monte-Carlo Tree Search is used for "policy improvement" and Nearest Neighbors is used for "supervised learning", we establish that this method finds an $\varepsilon$-approximate value function of Nash equilibrium in $\widetildeO(\varepsilon^-(d+4))$ steps when the underlying state-space of the game is continuous and d-dimensional. This is nearly optimal as we establish a lower bound of $\widetildeØmega (\varepsilon^-(d+2) )$ for any policy.

## SESSION: Session 4: Foundations in Practice

### Session details: Session 4: Foundations in Practice

### Transforming Probabilistic Programs for Model Checking

Probabilistic programming is perfectly suited to reliable and transparent data science, as it allows the user to specify their models in a high-level language without worrying about the complexities of how to fit the models. Static analysis of probabilistic programs presents even further opportunities for enabling a high-level style of programming, by automating time-consuming and error-prone tasks. We apply static analysis to probabilistic programs to automate large parts of two crucial model checking methods: Prior Predictive Checks and Simulation-Based Calibration. Our method transforms a probabilistic program specifying a density function into an efficient forward-sampling form. To achieve this transformation, we extract a factor graph from a probabilistic program using static analysis, generate a set of proposal directed acyclic graphs using a SAT solver, select a graph which will produce provably correct sampling code, then generate one or more sampling programs. We allow minimal user interaction to broaden the scope of application beyond what is possible with static analysis alone. We present an implementation targeting the popular Stan probabilistic programming language, automating large parts of a robust Bayesian workflow for a wide community of probabilistic programming users.

### StyleCAPTCHA: CAPTCHA Based on Stylized Images to Defend against Deep Networks

CAPTCHAs are widely deployed for bot detection. Many CAPTCHAs are based on visual perception tasks such as text and objection classification. However, they are under serious threat from advanced visual perception technologies based on deep convolutional networks (DCNs). We propose a novel CAPTCHA, called StyleCAPTCHA, that asks a user to classify stylized human versus animal face images. StyleCAPTCHA creates each stylized image by combining the content representations of a human or animal face image and the style representations of a reference image. Both the original face image and the style reference image are hidden from the user. To defend against attacks using DCNs, the StyleCAPTCHA service changes the style regularly. To adapt to the new styles, the attacker has to repeatedly train or retrain her DCNs, but since the attacker has insufficient training examples, she cannot train her DCNs well. We also propose Classifier Cross-task Transferability to measure the transferability of a classifier from its original task to another task. This metric allows us to arrange the schedule of styles and to limit the transferability of attackers' DCNs across classification tasks using different styles. Our evaluation shows that StyleCAPTCHA defends against state-of-the-art face detectors and against general DCN classifiers effectively.

### Statistical Significance in High-dimensional Linear Mixed Models

This paper develops an inferential framework for high-dimensional linear mixed effect models. Such models are suitable, e.g., when collecting n repeated measurements for M subjects. We consider a scenario where the number of fixed effects p is large (and may be larger than M), but the number of random effects q is small. Our framework is inspired by a recent line of work that proposes de-biasing penalized estimators to perform inference for high-dimensional linear models with fixed effects only. In particular, we demonstrate how to correct a 'naive' ridge estimator to build asymptotically valid confidence intervals for mixed effect models. We validate our theoretical results with numerical experiments that show that our method can successfully account for the correlation induced by the random effects. For a practical demonstration we consider a riboflavin production dataset that exhibits group structure, and show that conclusions drawn using our method are consistent with those obtained on a similar dataset without group structure.

### Dynamical Gaussian Process Latent Variable Model for Representation Learning from Longitudinal Data

Many real-world applications involve longitudinal data, consisting of observations of several variables, where different subsets of variables are sampled at irregularly spaced time points. We introduce the Longitudinal Gaussian Process Latent Variable Model (L-GPLVM), a variant of the Gaussian Process Latent Variable Model, for learning compact representations of such data. L-GPLVM overcomes a key limitation of the Dynamic Gaussian Process Latent Variable Model and its variants, which rely on the assumption that the data are fully observed over all of the sampled time points. We describe an effective approach to learning the parameters of L-GPLVM from sparse observations, by coupling the dynamical model with a Multitask Gaussian Process model for sampling of the missing observations at each step of the gradient-based optimization of the variational lower bound. We further show the advantage of the Sparse Process Convolution framework to learn the latent representation of sparsely and irregularly sampled longitudinal data with minimal computational overhead relative to a standard Latent Variable Model. We demonstrated experiments with synthetic data as well as variants of MOCAP data with varying degrees of sparsity of observations that show that L-GPLVM substantially and consistently outperforms the state-of-the-art alternatives in recovering the missing observations even when the available data exhibits a high degree of sparsity. The compact representations of irregularly sampled and sparse longitudinal data can be used to perform a variety of machine learning tasks, including clustering, classification, and regression.