CS2125 Paper Review Form - Winter 2019

Reviewer: Nils Wenzler

Paper Title: DeepTest: Automated Testing of Deep-Neural-Network-driven 
Autonomous Cars

Author(s): Yuchi Tian, Suman Jana, Kexin Pei, Baishakhi Ray

1) Is the paper technically correct?
  [ ] Yes
  [X] Mostly (minor flaws, but mostly solid)
  [ ] No

2) Originality
  [ ] Very good (very novel, trailblazing work)
  [ ] Good
  [X] Marginal (very incremental)
  [ ] Poor (little or nothing that is new)

3) Technical Depth
  [ ] Very good (comparable to best conference papers)
  [X] Good (comparable to typical conference papers)
  [ ] Marginal depth
  [ ] Little or no depth

4) Impact/Significance
  [ ] Very significant
  [ ] Significant
  [X] Marginal significance.
  [ ] Little or no significance.

5) Presentation
  [ ] Very well written
  [X] Generally well written
  [ ] Readable
  [ ] Needs considerable work
  [ ] Unacceptably bad

6) Overall Rating
  [ ] Strong accept (award quality)
  [X] Accept (high quality - would argue for acceptance)
  [ ] Weak Accept (borderline, but lean towards acceptance)
  [ ] Weak Reject (not sure why this paper was published)


7) Summary of the paper's main contribution and rationale
    for your recommendation. (1-2 paragraphs)

In their paper, Tian et al. propose a new approach to automated testing 
of deep neural networks in the context fo autonomous vehicles. They 
propose a full process starting at the automated generation of test 
cases with corresponding labels and ending at the optimization of a DNN 
coverage metric called neural coverage.
For the generation of new test cases, they use basic transformations 
that they apply to images of the original training data. To obtain 
fitting labels, they stick closely to so-called metamorphic relations 
which closely resemble the original labels but with a margin of error. 
Neural coverage is a metric, that was introduced in a paper of similar 
context but lacking the automated test generation. In terms of 
optimization, they stick to a basic greedy search algorithm.

The proposed process is in overall an interesting approach. They mostly 
combine existing techniques to reach this goal. That's why I determine 
the originality and significance to be not the highest. In general their 
approach is correct, although at some points they stick to "magic 
numbers" which are not further justified or as for the example of 
correlation between neural coverage and different behaviours use very 
broad ways of justifying it.

8) List 1-3 strengths of the paper.  (1-2 sentences each,
identified as S1, S2, S3.)

S1: The paper is well structured and in general well understandable. 
Especially the substructure of answering the single research questions 
makes the paper easier to read and scan.

S2: The paper is inspired heavily by existing automated testing 
approaches. It's allways a good thing to not reinvent the wheel where 
possible.

S3: They test their approach with a set of three very different models 
and use a considerable high number of available transformations. They do 
not just say that they found ~6000 errorneous behaviours, but offer an 
in-depth break down.


9) List 1-3 weaknesses of the paper (1-2 sentences each,
identified as W1, W2, W3.)

W1: The paper is at some places not fully scientifical justified. Magic 
numbers and very basic and broad evidence is sufficient for the authors 
to justify their approach.

W2: The greedy search algorithm, one of their original parts of the 
work, is not well-defined by their presentation. It is unclear how the 
queue of transformations work and why the resulting stack can sometimes 
increase over two transformations.

W3: The generated images are per se assumed to be realistic. The fact 
that neither the fog nor the rain on a sunny day offer realisitc input 
or a reasioning how a black border translation is a realisitc input is 
offered.