CS2125 Paper Review Form - Winter 2019 Reviewer: Abdul Kawsar Tushar Paper Title: DeepMutation: Mutation Testing of Deep Learning Systems Author(s): Lei Ma, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Felix Juefei-Xu, Chao Xie, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang 1) Is the paper technically correct? [X] Yes [ ] Mostly (minor flaws, but mostly solid) [ ] No 2) Originality [X] Very good (very novel, trailblazing work) [ ] Good [ ] Marginal (very incremental) [ ] Poor (little or nothing that is new) 3) Technical Depth [ ] Very good (comparable to best conference papers) [X] Good (comparable to typical conference papers) [ ] Marginal depth [ ] Little or no depth 4) Impact/Significance [ ] Very significant [X] Significant [ ] Marginal significance. [ ] Little or no significance. 5) Presentation [X] Very well written [ ] Generally well written [ ] Readable [ ] Needs considerable work [ ] Unacceptably bad 6) Overall Rating [ ] Strong accept (award quality) [X] Accept (high quality - would argue for acceptance) [ ] Weak Accept (borderline, but lean towards acceptance) [ ] Weak Reject (not sure why this paper was published) 7) Summary of the paper's main contribution and rationale for your recommendation. (1-2 paragraphs) The paper DeepMutation introduces a well-establish technique in software engineering i.e. mutation testing into the deep learning domain. Specifically, this is the first attempt to test the quality of the test data set through the injection of faults ("mutations") into both the model and the data using deep neural networks. The same set of authors has been experimenting with various ways to test the robustness of deep learning networks for the past couple of years, which aligns well with the background of the first author in software testing. The novel contribution of the paper is a set of techniques to introduce faults in both the source-level resources (code and data) before training and model-level architecture after training. Additionally, they propose two metrics to evaluate the quality of the test data in finding the mutation in the deep learning models. There were some insights into how the test data could be augmented after looking at these metrics, such as class-wise augmentation of the data. However, the authors generate more questions than answers. Where do we go after we introduce mutation to the models? What are the relations between the two metrics introduced here? Why do we have lower (or higher) values on these metric scales? These are some questions that the paper neither tries to answer, nor promises to in any future paper. 8) List 1-3 strengths of the paper. (1-2 sentences each, identified as S1, S2, S3.) S1. The first attempt to introduce mutation testing in deep learning systems. S2. Proposal of a set of metrics to measure performance in mutation testing. S3. Well-written paper. 9) List 1-3 weaknesses of the paper (1-2 sentences each, identified as W1, W2, W3.) W1. Does not provide any real insight into the data gained from the experiments and the scores on the sacel of the proposed metrics. W2. Does not try to explain the relation between the metrics.