CS2125 Paper Review Form - Winter 2019 Reviewer: Eric Langlois Paper Title: Semantic Adversarial Deep Learning Author(s): Tommaso Dreossi, Somesh Jha, Sanjit A. Seshia 1) Is the paper technically correct? [X] Yes [ ] Mostly (minor flaws, but mostly solid) [ ] No 2) Originality [ ] Very good (very novel, trailblazing work) [ ] Good [ ] Marginal (very incremental) [X] Poor (little or nothing that is new) 3) Technical Depth [ ] Very good (comparable to best conference papers) [ ] Good (comparable to typical conference papers) [X] Marginal depth [ ] Little or no depth 4) Impact/Significance [ ] Very significant [ ] Significant [X] Marginal significance. [ ] Little or no significance. 5) Presentation [ ] Very well written [X] Generally well written [ ] Readable [ ] Needs considerable work [ ] Unacceptably bad 6) Overall Rating [ ] Strong accept (award quality) [ ] Accept (high quality - would argue for acceptance) [ ] Weak Accept (borderline, but lean towards acceptance) [X] Weak Reject (not sure why this paper was published) 7) Summary of the paper's main contribution and rationale for your recommendation. (1-2 paragraphs) The main contribution of this paper is an abstract compositional verification approach for systems involving machine-learned (ML) components. In the proposed approach, system-level constraints are used to develop a "region of uncertainty" (ROC) for an integrated ML component. The ROC then guides a ML-specific analyzer to find prediction errors, which are then checked against the system level constraints as a whole. The authors perform several experiments identifying errors in ML models. The paper also includes a large background section on machine learning models and attacks. While the proposed approach is interesting, I find that this paper lacks focus and depth. After describing the compositional verification algorithm in abstract, the authors proceed to experiment with the non-novel part: the ML model-specific counterexample generation. The paper then moves on to an analysis of the relatively old and well-studied hinge loss function. I recommend that the authors shorten the background, remove the discussion of hinge loss, describe the ROC generation in greater detail, and perform experiments investigating the entire compositional verification algorithm. 8) List 1-3 strengths of the paper. (1-2 sentences each, identified as S1, S2, S3.) S1. The proposed compositional verification algorithm appears practical and potentially effective. Its description is clear and well-written. S2. The focus on semantic adversarial analysis is motivated well, and I support the authors advocating for semantics as a guiding principle when generating counterexamples to machine learning systems. 9) List 1-3 weaknesses of the paper (1-2 sentences each, identified as W1, W2, W3.) W1. None of the experiments appear to involve the main proposal: the compositional verification approach. The experiments are focused on finding counterexamples using (1) a simple sampling strategy (2) over a space that appears hand-designed, not derived from system-level constraints. W2. The analysis of the hinge loss is out of place and does not contribute to the paper. Given the history of the hinge loss function I find it unlikely that the content presented is particularly novel. W3. The region of uncertainty generation algorithm is not sufficiently analyzed. How is the "completely-wrong classifier" defined when there are multiple possible classes? What if the system uses the probabilities output by the model like the authors advocate, not just the classes? What if the system invokes the model multiple times, how is the combinatorial growth of possible predictions handled?