Deep Neural Solution to Document Level Novelty Detection

Abstract

The rapid growth of documents across the web has necessitated finding means of discarding redundant documents and retaining novel ones. Capturing redundancy is challenging as it may involve investigating at a deep semantic level. Techniques for detecting such semantic redundancy at the document level are scarce. In this work we propose a deep Convolutional Neural Network (CNN) based model to classify a document as novel or redundant with respect to a set of relevant documents already seen by the system. The system is simple and does not require manual feature engineering. Our novel scheme encodes relevant and relative information from both source and target texts to generate an intermediate representation for which we coin the name Relative Document Vector (RDV). The proposed method outperforms the existing benchmark on two document-level novelty detection datasets by a margin of ∼5% in terms of accuracy. We further demonstrate the effectiveness of our approach on a standard paraphrase detection dataset where the paraphrased passages closely resembles semantically redundant documents.

Images

RDV-CNN architecture

Full paper

Click here to see the full paper which was published in COLING 2018 conference

GitHub repository

Click here to see the code for replicating the results in the paper

Deep Neural Solution to Document Level Novelty Detection

Convolutional Neural Network to filter out semantically redundant documents with respect to a set of relevant documents

Deep Neural Solution to Document Level Novelty Detection

Convolutional Neural Network to filter out semantically redundant documents with respect to a set of relevant documents

Abstract

Images

Full paper

GitHub repository