CS2125 Paper Review Form - Winter 2019 Reviewer: Hazem Ibrahim Paper Title: DeepMutation: Mutation Testing of Deep Learning Systems Author(s): Lei Ma, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Felix Jeufei-Xu, Chao Xie, Li Li, Yang Liu, Jianjun Zhao, Yadong Wang 1) Is the paper technically correct? [X] Yes [ ] Mostly (minor flaws, but mostly solid) [ ] No 2) Originality [X] Very good (very novel, trailblazing work) [ ] Good [ ] Marginal (very incremental) [ ] Poor (little or nothing that is new) 3) Technical Depth [ ] Very good (comparable to best conference papers) [X] Good (comparable to typical conference papers) [ ] Marginal depth [ ] Little or no depth 4) Impact/Significance [ ] Very significant [X] Significant [ ] Marginal significance. [ ] Little or no significance. 5) Presentation [ ] Very well written [ ] Generally well written [X] Readable [ ] Needs considerable work [ ] Unacceptably bad 6) Overall Rating [X] Strong accept (award quality) [ ] Accept (high quality - would argue for acceptance) [ ] Weak Accept (borderline, but lean towards acceptance) [ ] Weak Reject (not sure why this paper was published) 7) Summary of the paper's main contribution and rationale for your recommendation. (1-2 paragraphs) In this paper, the authors propose a mutation testing framework created specifically for deep learning systems. To do this, the authors proposed two types of mutations that would be applied on the model. Firstly, source-level mutations were applied, which are applied at the data level or at the layer level. Secondly, model-level mutations were also applied which try to evaluate the effectiveness and locate the weakness of the test dataset. A number of metrics were introduced to evaluate the mutation framework proposed, including mutation score which evaluates the ratio of the number of mutated models which provide a differing classification than the original model, to the total number of mutatated models. The second metric was average error rate, which evaluates the sum of the error rates for each model on a given test data over the total number of mutated models. The results from this paper indicated that this approach were encouraging, indicating high mutation scores for the models tested. I would argue for the acceptance of this paper despite it's flaws described in the weaknesses section, as it provides an interesting and new approach for the evaluation of test sets in Deep Learning systems. 8) List 1-3 strengths of the paper. (1-2 sentences each, identified as S1, S2, S3.) S1. The explanations behind each mutation were informative and gave a clear explanation as to why this mutation was selected. S2. The diagrams provided to explain the process of mutation testing were informative. 9) List 1-3 weaknesses of the paper (1-2 sentences each, identified as W1, W2, W3.) W1. Many grammatical errors were made in this paper. W2. Several sections have repetitive information (Section 2B and Section 3A for example) which is unneccessary. W3. Related work should be included near the start of the paper rather than at the end to provide context for the following proposal.