CS2125 Paper Review Form - Winter 2019

Reviewer: Eric Langlois

Paper Title:
DeepXplore: Automated Whitebox Testing of Deep Learning Systems

Author(s):
Kexin Pei, Yinzhi Cao, Junfeng Yang, Suman Jana

1) Is the paper technically correct?
 [ ] Yes
 [ ] Mostly (minor flaws, but mostly solid)
 [ ] No

2) Originality
 [ ] Very good (very novel, trailblazing work)
 [X] Good
 [ ] Marginal (very incremental)
 [ ] Poor (little or nothing that is new)

3) Technical Depth
 [ ] Very good (comparable to best conference papers)
 [X] Good (comparable to typical conference papers)
 [ ] Marginal depth
 [ ] Little or no depth

4) Impact/Significance
 [ ] Very significant
 [X] Significant
 [ ] Marginal significance.
 [ ] Little or no significance.

5) Presentation
 [ ] Very well written
 [X] Generally well written
 [ ] Readable
 [ ] Needs considerable work
 [ ] Unacceptably bad

6) Overall Rating
 [ ] Strong accept (award quality)
 [ ] Accept (high quality - would argue for acceptance)
 [X] Weak Accept (borderline, but lean towards acceptance)
 [ ] Weak Reject (not sure why this paper was published)


7) Summary of the paper's main contribution and rationale
   for your recommendation. (1-2 paragraphs)

This paper presents DeepXplore, the "first whitebox framework for systematically
testing real-world DL systems". DeepXplore efficiently finds test inputs for an
ensemble of neural networks by performing constrained gradient descent on two
objectives: network prediction disagreement, and neuron coverage. The concept of
neuron coverage is introduced as a metric for evaluating test comprehensiveness.
On evaluation, the approach quickly finds many inputs on which the network
ensemble disagree. Applications are presented, including using the inputs for
data augmentation.

I recommend this paper for acceptance because it introduces a novel and likely
useful testing technique for neural networks. Both the high level algorithm and
the concept of neuron coverage are valuable contributions that are clearly
described. The work contains a diverse set of experiments and for the most part
the paper is well-structured and clearly written. However, many of the low-level
design decisions of the paper are questionable and poorly motivated. For
example, the authors use the notion of the "rules" learned by a neural network
to justify the coverage criterion without elaborating on whether neural network
behaviour can be effectively described by rules. In addition, there are several
errors or omissions throughout the paper. The quality of the paper would be
substantially improved with stronger justifications and correction of missing
information.

8) List 1-3 strengths of the paper.  (1-2 sentences each,
identified as S1, S2, S3.)

S1 The DeepXplore concept appears to be a novel practical and useful approach for
	neural network testing.

S2 The authors experimentally test many different relevant features of their
	algorithm and design: neuron coverage influence, speed and efficiency of
	example discovery, effect of hyper-parameter changes, and improvement from
	training set augmentation.

S3 The code for paper is released open-source (not mentioned in the paper but it
	is available on Github).

9) List 1-3 weaknesses of the paper (1-2 sentences each,
identified as W1, W2, W3.)

W1 Many design decisions are unjustified or weakly justified and seem potentially
	suboptimal. E.g. one-sided coverage metric, selecting one random neuron for
	the coverage objective, and constraints on the gradient where a different
	input parameterization could instead be used.

W2 Some information is omitted from tables and figure descriptions: what is the
	meaning of "#diff" in Table 12? Does Figure 10 measure train or test error?

W3 The authors attempt to justify neuron coverage by comparing it to code
	coverage, but code coverage is a meaningless metric on neural networks so
	beating it means very little.