CS2125 Paper Review Form - Winter 2019 Reviewer: Nils Wenzler Paper Title: DeepXplore: Automated Whitebox Testing of Deep Learning Systems Author(s): Kexin Pei, Yinzhi Cao, Junfeng Yang, Suman Jana 1) Is the paper technically correct? [ ] Yes [X] Mostly (minor flaws, but mostly solid) [ ] No 2) Originality [ ] Very good (very novel, trailblazing work) [X] Good [ ] Marginal (very incremental) [ ] Poor (little or nothing that is new) 3) Technical Depth [ ] Very good (comparable to best conference papers) [X] Good (comparable to typical conference papers) [ ] Marginal depth [ ] Little or no depth 4) Impact/Significance [ ] Very significant [X] Significant [ ] Marginal significance. [ ] Little or no significance. 5) Presentation [ ] Very well written [X] Generally well written [ ] Readable [ ] Needs considerable work [ ] Unacceptably bad 6) Overall Rating [ ] Strong accept (award quality) [X] Accept (high quality - would argue for acceptance) [ ] Weak Accept (borderline, but lean towards acceptance) [ ] Weak Reject (not sure why this paper was published) 7) Summary of the paper's main contribution and rationale for your recommendation. (1-2 paragraphs) In their paper, Pei et al. introduce a first automated process for performing whitebox testing in the context of DNNs. They leverage differential testing techniques to tackle the problem of expensive manual labeling. Furthermore, they propose neuron coverage as a test coverage metric for DNNs. The combination of both techniques show how classical software testing techniques could be modified and applied to enable valuable DNN testing techniques. This being a first approach, it still has a lot of open questions and assumptions. Pei et al. do a first step in trying to answer some of them, but still leave many questions unanswered. Nonetheless, this paper has inspired a set of followup papers to dig more deeply in the general proposed approach. 8) List 1-3 strengths of the paper. (1-2 sentences each, identified as S1, S2, S3.) S1: The proposed approach to testing DNNs is very novel. Especially the introduction of neuron coverage has inspired a set of similar papers to look into different metrics. S2: The evaluation uses a big number of empirical test cases of different domains. S3: In contrast to my first paper, they do some research on the threshold of activation. 9) List 1-3 weaknesses of the paper (1-2 sentences each, identified as W1, W2, W3.) W1: Some claims that the paper makes are not further proven or justified. One of the examples with highest impact is why neuron coverage makes sense. W2: The setting of having several trained networks that can do a majority vote is unrealistic. Furthermore, the borderline cases for different networks might be surprising similar. W3: The domain-specific constraints limit the exhaustiveness of the search space very strongly. Once again a specification of the expected bordercases is needed ("think of what you didn't think of") W4: The comparison of neuron coverage to code coverage for DNNs is a bad joke. Nobody would ever use code coverage for a DNN, since it as they argue themselves does not make sense.