## Classification with asymmetric label noise: Consistency and maximal denoising

**Gilles Blanchard et al.**

**Gilles Blanchard et al.**

DOI: 10.1214/16-EJS1193

In many real-world classification problems, the labels of training examples are randomly corrupted. Most previous theoretical work on classification with label noise assumes that the two classes are separable, that the label noise is independent of the true class label, or that the noise proportions for each class are known. In this work, we give conditions that are necessary and sufficient for the true class-conditional distributions to be identifiable. These conditions are weaker than those analyzed previously, and allow for the classes to be nonseparable and the noise levels to be asymmetric and unknown. The conditions essentially state that a majority of the observed labels are correct and that the true class-conditional distributions are “mutually irreducible,” a concept we introduce that limits the similarity of the two distributions. For any label noise problem, there is a unique pair of true class-conditional distributions satisfying the proposed conditions, and we argue that this pair corresponds in a certain sense to maximal denoising of the observed distributions. Our results are facilitated by a connection to “mixture proportion estimation,” which is the problem of estimating the maximal proportion of one distribution that is present in another. We establish a novel rate of convergence result for mixture proportion estimation, and apply this to obtain consistency of a discrimination rule based on surrogate loss minimization. Experimental results on benchmark data and a nuclear particle classification problem demonstrate the efficacy of our approach.

## Non-linear learning in the presence of label noise

*Ata Kabán*

*Ata Kabán*

The classical machinery of supervised learning machines relies on a correct set of training labels. Unfortunately, there is no guarantee that all of the labels are correct. Labelling errors are increasingly noticeable in today’s classification tasks, as the scale and difficulty of these tasks increases so much that perfect label assignment becomes nearly impossible. Several algorithms have been proposed to alleviate the problem. For non-linear classification, the flexibility of the classifier together with the uncertainty about the labels make the problem ill-posed in general.

This talk addresses two approaches for devising label-robust classifiers through probabilistic modelling of label corruption. The first approach makes use of a widely used kernelising technique to devise a label-robust Kernel Logistic Regression classifier. To determine the model complexity parameters when no trusted validation set is available, we adapted a Multiple Kernel Learning approach for this new purpose, together with a Bayesian regularisation scheme. Extensive empirical results on 13 benchmark data sets, as well as two real-world applications demonstrate the success of this approach. The second approach targets boosting ensembles. Boosting is known to be particularly sensitive to label noise. To robustify it we may consider employing label-noise robust base learners, or we may modify the AdaBoost algorithm itself. Empirical tests suggest that a committee of robust classifiers is still susceptible to label noise, but taking both options to robustify results in a more resilient algorithm under mislabelling.

**References**

- J. Bootkrajang, A. Kaban. Learning kernel logistic regression in the

presence of class label noise, Pattern Recognition 47 (11), 3641-3655,

2014. - J. Bootkrajang, A. Kaban. Boosting in the presence of label noise. UAI

2013.

## Searching for the Optimal Data Preprocessing Strategy

*Laure Berti-Equille*

Data science experts estimate that as much as 40 to 50% of a project budget might be spent correcting data errors in time-consuming, labor-intensive and tedious processes. Data mining analysts spend up to 80% of their time, not doing actual quantitative analysis or predictive modeling, but preparing the data. Many applications of data analytics and knowledge discovery from data require various forms of data preprocessing and repairing because the data -input to the algorithms- do not necessarily conform to “nice” data distributions and contain outlying, missing, inconsistent, duplicate, and incorrect values. This leaves a large gap between the available “dirty” data and the available machinery to process the data for decisional purposes. This talk will review data quality problems and related data preprocessing solutions with a particular emphasis on the impact of selecting particular data preprocessing strategies, as this choice may dramatically change the final output results and the conclusions of the analysis. How can we ensure that the selected data preprocessing strategy is adequate and “optimal” to some extent ? This talk will attempt to answer this question with a principled approach and give preliminary elements of solution to address this problem.

## Supervised classification of satellite image time series from noisy labeled data

*Charlotte Pelletier et al.*

Supervised classification systems used for land cover mapping require large amounts of accurate training data. These reference data come generally from different sources such as field measurements, thematic maps, or aerial photos. Due to misregistration, update delay, or land cover complexity, they may contain class label noise, i.e. a wrong label assignment, that results in a classification performance degradation.

Firstly, this work aims at evaluating the impact of mislabeled training data on classification performances. Particularly, it addresses the random class label noise problem for the classification of high resolution satellite image time series. Experiments are carried out on synthetic and real datasets with two traditional classifiers: the Support Vector Machines (SVM) and the Random Forests (RF). The synthetic dataset has been specifically designed for this study, simulating vegetation index profiles over one year.

Secondly, a performance evaluation of outlier detection methods is presented in our mislabeled data identification problem. The goal is to show how such methods can be used in the classification framework as an interesting filtering step to take into account the mislabeled data. Different iterative filtering strategies will be also presented and evaluated.

## Tell me What Label Noise is and I will Tell you how to Dispose of it

*Benoît Frénay*

Label noise is an important issue in classification, with many potential negative consequences, and many works in the literature have been devoted to its study. This talk will propose a broad overview of the different types of label noise, their consequences and the algorithms to deal with them. Then, two generic approaches will discussed to design new label noise-tolerant machine learning techniques. Several use cases will be described, including ECG segmentation, feature selection, data stream handling and maximum likelihood methods.

## Counting Votes and the Attempt to Replicate Human Interpretation

*Daniel P. Lopresti*

After an initial rush to adopt purely electronic voting machines (so-called DRE’s), a number of locations in the United States are returning to the use of optical scan ballots as a more trustworthy way of recording votes. While presenting a number of significant advantages, voting on paper presents some interesting technical challenges as well. These largely arise from the complexities of attempting to build machine vision systems to replicate human interpretation of markings that can be ambiguous. What seems like a relatively easy pattern recognition can serve as the basis for a rich line of research questions. I will highlight these issues in my talk, and also describe a unique collection of real ballot images from a US election a number of years ago that can serve as a useful test case.