CS2125 Paper Review Form - Winter 2019 Reviewer: Yasaman Rohanifar Paper Title: DeepMutation: Mutation Testing of Deep Learning Systems Author(s): Lei Ma, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Felix Jeufei-Xu, Chao Xie, Li Li, Yang Liu, Jianjun Zhao, Yadong Wang 1) Is the paper technically correct? [*] Yes [ ] Mostly (minor flaws, but mostly solid) [ ] No 2) Originality [*] Very good (very novel, trailblazing work) [ ] Good [ ] Marginal (very incremental) [ ] Poor (little or nothing that is new) 3) Technical Depth [ ] Very good (comparable to best conference papers) [*] Good (comparable to typical conference papers) [ ] Marginal depth [ ] Little or no depth 4) Impact/Significance [*] Very significant [ ] Significant [ ] Marginal significance. [ ] Little or no significance. 5) Presentation [ ] Very well written [*] Generally well written [ ] Readable [ ] Needs considerable work [ ] Unacceptably bad 6) Overall Rating [ ] Strong accept (award quality) [*] Accept (high quality - would argue for acceptance) [ ] Weak Accept (borderline, but lean towards acceptance) [ ] Weak Reject (not sure why this paper was published) 7) Summary of the paper's main contribution and rationale for your recommendation. (1-2 paragraphs) This paper proposes a mutation testing framework specialized for deep learning systems to measure the quality of test data. The contribution of this paper is to modify and utilize the traditional mutation testing to be used in deep-learning systems (as these techniques can not be directly applied to deep learning systems due to their fundamental differences with traditional software systems). This is beneficial to evaluate the quality of the data set being used in deep learning systems, thus improving their overall performance. By applying the same original ideas of mutation testing, the authors proposed 2 mutation testing techniques, one in source-level to perform manipulation at the data training level, and second, in model-level to directly mutate the structure and parameters of DL models. They defined a set of source-level mutation operators to inject faults to the source of deep learning, then they designed a set of model-level mutation operators to directly inject faults without training process, and lastly, they evaluated the quality of the test data by analyzing to what extent the injected faults (from the previous stages) have been detected by introducing two quantitaive metrics (mutation score, and average error rate). They evaluated the utility of their framework on 2 public data sets with 3 deep learning models that approved the usefulness of their framework in improving the quality of test data. I am not specialized in this field, but after reading this paper, I believe it should be accepted as it introduces a new approach to improve the quality of test sets used in DL systems and further provide feedback and guide the test enhancement. This can radically improve the quality and speed of current DL systems. 8) List 1-3 strengths of the paper. (1-2 sentences each,identified as S1, S2, S3.) S1: The design of 8 mutation operators to directly manipulate deep learning models for fault inclusion (as training DNN models can be computationally intensive), seems like a good idea to minimize the computation time as well as providing the opportunity to capture more fine-grained model-level problems. S2: Well-written and easy-to-read, even for people who are not specialized in deep learning systems. Good use of charts, tables, and workflows to better convey their ideas. No evident technical errors were witnessed. S3: Comprehensive and detailed evaluation section. The authors show awareness of shortcomings and challenges and have proposed proper solutinons (evident in Threats to validity section). Good explanation of their rationals behind choosing the operators. 9) List 1-3 weaknesses of the paper (1-2 sentences each, identified as W1, W2, W3.) W1: Though some sections are detailed, to my opinion, some parts of the paper lack sufficient explanation. For example the reason behind the non-deterministic behavior of GPU vs the deterministic behavior of CPU in one of the scnarios mentioned in the paper is not explained. W2: A considerable number of typos and erros.