CS2125 Paper Review Form - Winter 2019 Reviewer: Eric Langlois Paper Title: DeepXplore: Automated Whitebox Testing of Deep Learning Systems Author(s): Kexin Pei, Yinzhi Cao, Junfeng Yang, Suman Jana 1) Is the paper technically correct? [ ] Yes [ ] Mostly (minor flaws, but mostly solid) [ ] No 2) Originality [ ] Very good (very novel, trailblazing work) [X] Good [ ] Marginal (very incremental) [ ] Poor (little or nothing that is new) 3) Technical Depth [ ] Very good (comparable to best conference papers) [X] Good (comparable to typical conference papers) [ ] Marginal depth [ ] Little or no depth 4) Impact/Significance [ ] Very significant [X] Significant [ ] Marginal significance. [ ] Little or no significance. 5) Presentation [ ] Very well written [X] Generally well written [ ] Readable [ ] Needs considerable work [ ] Unacceptably bad 6) Overall Rating [ ] Strong accept (award quality) [ ] Accept (high quality - would argue for acceptance) [X] Weak Accept (borderline, but lean towards acceptance) [ ] Weak Reject (not sure why this paper was published) 7) Summary of the paper's main contribution and rationale for your recommendation. (1-2 paragraphs) This paper presents DeepXplore, the "first whitebox framework for systematically testing real-world DL systems". DeepXplore efficiently finds test inputs for an ensemble of neural networks by performing constrained gradient descent on two objectives: network prediction disagreement, and neuron coverage. The concept of neuron coverage is introduced as a metric for evaluating test comprehensiveness. On evaluation, the approach quickly finds many inputs on which the network ensemble disagree. Applications are presented, including using the inputs for data augmentation. I recommend this paper for acceptance because it introduces a novel and likely useful testing technique for neural networks. Both the high level algorithm and the concept of neuron coverage are valuable contributions that are clearly described. The work contains a diverse set of experiments and for the most part the paper is well-structured and clearly written. However, many of the low-level design decisions of the paper are questionable and poorly motivated. For example, the authors use the notion of the "rules" learned by a neural network to justify the coverage criterion without elaborating on whether neural network behaviour can be effectively described by rules. In addition, there are several errors or omissions throughout the paper. The quality of the paper would be substantially improved with stronger justifications and correction of missing information. 8) List 1-3 strengths of the paper. (1-2 sentences each, identified as S1, S2, S3.) S1 The DeepXplore concept appears to be a novel practical and useful approach for neural network testing. S2 The authors experimentally test many different relevant features of their algorithm and design: neuron coverage influence, speed and efficiency of example discovery, effect of hyper-parameter changes, and improvement from training set augmentation. S3 The code for paper is released open-source (not mentioned in the paper but it is available on Github). 9) List 1-3 weaknesses of the paper (1-2 sentences each, identified as W1, W2, W3.) W1 Many design decisions are unjustified or weakly justified and seem potentially suboptimal. E.g. one-sided coverage metric, selecting one random neuron for the coverage objective, and constraints on the gradient where a different input parameterization could instead be used. W2 Some information is omitted from tables and figure descriptions: what is the meaning of "#diff" in Table 12? Does Figure 10 measure train or test error? W3 The authors attempt to justify neuron coverage by comparing it to code coverage, but code coverage is a meaningless metric on neural networks so beating it means very little.