Paper ID: | 3364 |
---|---|

Title: | L_DMI: A Novel Information-theoretic Loss Function for Training Deep Nets Robust to Label Noise |

Overall, the paper is well organized and easy to follow. The problem of learning dnn robust to label noise is interesting in both research and practical application. The proposed robust loss function sounds reasonable and should be effectivess in some case. For the weakness of the paper, see the following improvements suggestion.

Label noise learning is a hot topic now as the datasets grow bigger and the labels are becoming noisier. How to learn the optimal classifier w.r.t. the clean data from the noisy data is challenging. To guarantee to learn the optimal classifier, many robust learning methods have been proposed. To the best of my knowledge, they all need the information of the transition matrix, learning which could be challenging. This paper proposes the first loss function that is robust to instance-independent label noise without knowing the transition matrix. Thus, making a significant contribution to the community. My main concern is that once the latent true variable Y is not identifiable, e.g., there exists another latent variable Y', which also has a transition relationship with the noisy label. How to find the optimal classifier for Y? This case generally exists when the transition matrix is not identifiable. From the experiments, we find that the proposed method doesn't perform well when the noise rate is low. Intuitively, the proposed method should work well for clean data as well. The authors are suggested to do some comparisons on clean data and explain why it doesn't well well on small noise. === After seeing the rebuttal, my main concern hasn’t been well addressed. The current paper has an issue that it is unclear if the latent true label Y is identifiable in the proposed method. It seems that if there is a latent variable Y’ which also has a constant transition matrix with the noisy labels. The proposed method cannot distinguish Y and Y’. The authors response that the proposed method will learn something more informative about X. This is reasonable, but it also implies that if Y’ is more informative about X. The proposed method will lead to wrong predictions. For example, in the case that instances potentially have multiple classes and there are constant transition relationships among them, it may be very hard for the proposed method to distinguish them. It seems necessary to illustrate if the proposed method works well on clean data. If it did not work well on the clean data, it may be caused by the reason that the proposed method finds a Y’ instead of Y. I am worrying that the authors avoid learning the transition matrix but were introducing a harder problem of identifying Y.

This paper proposes a new information theoretic loss function, L_DML, for training deep neural networks robust to noisy labels. Specifically, this paper first proposes a new information measure, DMI (Determinant based Mutual Information), which is a generalized version of mutual information. Based on the relative invariance of DMI, this paper proposes a noise-robust loss function called L_DMI, which is theoretically justified. Experiments on synthetic and real-world noisy datasets demonstrate the effectiveness of the proposed L_DMI on defending diagonally dominant and diagonally non-dominant noise. Pros: 1) The proposed information measure DMI, and robust loss function L_DMI, are both theoretically justified. 2) The proposed L_DMI loss function is easy to implement. 3) Empirical results on synthetic and real-world noisy datasets show that L_DMI outperforms other baselines. Cons: 1) The results on Fashion-MNIST show that the proposed method is not sensitive to noise patterns (i.e. class-independent and class-dependent noises), and noise amount (with probability from 0.1 to 0.9). However, it is unclear why converting Fashion-MNIST to two classes instead of using the original 10-class setting. Can explain more about that? To maintain consistency, the comparison with more baselines (i.e. LCCN, GCE, and FW) should also be provided. 2) For CIFAR-10 dataset, the proposed method is only evaluated by adding noise to similar classes. The evaluation on the uniform noise is missing. 3) For Dogs vs. Cats, why the experiment setting is not consistent with Fashion-MNIST, which is also two-class case? Specifically, the evaluations on uniform noise and ‘dog->cat’ noise are missing. Minor issue Line 283: “neutral networks” -> “neural networks” Overall, this paper proposes a new noise-robust loss function for defending label noise. The proposed method is both theoretically and empirically sound.

LDMI: A Novel Information-theoretic Loss Function for Training Deep Nets Robust to Label Noise: This paper formulates a new information-theoretic loss function, which is based on determinant based mutual information (DMI). Their main contribution is that this loss is provably not senstive to noise patterns and noise amounts. Pros: 1. The authors find a relatively new direction for learning with noisy labels. Namely, instead of designing distance-based losses, they try the information-theoretical loss. Based on this motivation, they design DMI loss, which is robust to label noise. 2. Related works: In deep learning with noisy labels, there are several main directions, including robust loss functions [1], reweighting trick [2], and explicit and implicit regularization [3]. I indeed appreciate authors survey them well. Note that, the authors may cite [4] in the regularization line due to its high impact. 3. The authors perform numerical experiments to demonstrate the efficacy of their framework. And their experimental result support their previous claims. For example, they conduct experiments on Fashion-MNIST, CIFAR-10 and Dogs vs. Cats. Besides, they conduct experiments on Clothing1M dataset [5]. Cons: We have two questions in the following. 1. Settings: This paper still focuses on class-conditional noise (CCN) model. However, CCN model may not cover the real-world noise case. The current emerging noise model is instance-dependent noise model [6,7]. I am not sure whether this idea can be depolyed under this case. 2. Experiments: 2.1 Datasets: I think the author should conduct 1 NLP dataset instead of only using image datasets. 2.2 Baselines: Please add the results from reweighting methods like MentorNet [2]; Please compare your method with VAT [4]. References: [1] Z. Zhang and M. Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In NeurIPS, 2018. [2] L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML, 2018. [3] H. Zhang, M. Cisse, Y.N. Dauphin, and Y.N. Lopez-Paz. Mixup: Beyond empirical risk minimization. In ICLR, 2018. [4] T. Miyato, S. Maeda, M. Koyama, and S. Ishii. Virtual adversarial training: A regularization method for supervised and semi-supervised learning. In ICLR, 2016. [5] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang. Learning from massive noisy labeled data for image classification. In CVPR, 2015. [6] J. Cheng, T. Liu, K. Rao, and D. Tao. Learning with bounded instance-and label-dependent label noise. arXiv 1709.03768, 2017. [7] A. Menon, B. Rooyen, and N. Natarajan. Learning from binary labels with instance-dependent corruption. Machine Learning, 2018.