CS2125 Paper Review Form - Winter 2019 Reviewer: Nils Wenzler Paper Title: DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars Author(s): Yuchi Tian, Suman Jana, Kexin Pei, Baishakhi Ray 1) Is the paper technically correct? [ ] Yes [X] Mostly (minor flaws, but mostly solid) [ ] No 2) Originality [ ] Very good (very novel, trailblazing work) [ ] Good [X] Marginal (very incremental) [ ] Poor (little or nothing that is new) 3) Technical Depth [ ] Very good (comparable to best conference papers) [X] Good (comparable to typical conference papers) [ ] Marginal depth [ ] Little or no depth 4) Impact/Significance [ ] Very significant [ ] Significant [X] Marginal significance. [ ] Little or no significance. 5) Presentation [ ] Very well written [X] Generally well written [ ] Readable [ ] Needs considerable work [ ] Unacceptably bad 6) Overall Rating [ ] Strong accept (award quality) [X] Accept (high quality - would argue for acceptance) [ ] Weak Accept (borderline, but lean towards acceptance) [ ] Weak Reject (not sure why this paper was published) 7) Summary of the paper's main contribution and rationale for your recommendation. (1-2 paragraphs) In their paper, Tian et al. propose a new approach to automated testing of deep neural networks in the context fo autonomous vehicles. They propose a full process starting at the automated generation of test cases with corresponding labels and ending at the optimization of a DNN coverage metric called neural coverage. For the generation of new test cases, they use basic transformations that they apply to images of the original training data. To obtain fitting labels, they stick closely to so-called metamorphic relations which closely resemble the original labels but with a margin of error. Neural coverage is a metric, that was introduced in a paper of similar context but lacking the automated test generation. In terms of optimization, they stick to a basic greedy search algorithm. The proposed process is in overall an interesting approach. They mostly combine existing techniques to reach this goal. That's why I determine the originality and significance to be not the highest. In general their approach is correct, although at some points they stick to "magic numbers" which are not further justified or as for the example of correlation between neural coverage and different behaviours use very broad ways of justifying it. 8) List 1-3 strengths of the paper. (1-2 sentences each, identified as S1, S2, S3.) S1: The paper is well structured and in general well understandable. Especially the substructure of answering the single research questions makes the paper easier to read and scan. S2: The paper is inspired heavily by existing automated testing approaches. It's allways a good thing to not reinvent the wheel where possible. S3: They test their approach with a set of three very different models and use a considerable high number of available transformations. They do not just say that they found ~6000 errorneous behaviours, but offer an in-depth break down. 9) List 1-3 weaknesses of the paper (1-2 sentences each, identified as W1, W2, W3.) W1: The paper is at some places not fully scientifical justified. Magic numbers and very basic and broad evidence is sufficient for the authors to justify their approach. W2: The greedy search algorithm, one of their original parts of the work, is not well-defined by their presentation. It is unclear how the queue of transformations work and why the resulting stack can sometimes increase over two transformations. W3: The generated images are per se assumed to be realistic. The fact that neither the fog nor the rain on a sunny day offer realisitc input or a reasioning how a black border translation is a realisitc input is offered.