Title: Abstractive Text Summarization

Company: Microsoft

NLP image

As an research intern at Knowledge Mining Team at Microsoft Research in the year 2020, I contributed to the Project Cortex. My primary responsibility was to address the challenge of producing abstractive summaries from bullet points, a task integral to the broader goal of providing detailed information about acronyms when hovering over them in a document. While my colleagues focused on implementing the full form and one-liner descriptions, my role involved exploring the feasibility of generating multi-line descriptions using state-of-the-art seq2seq models.

Multiline description meant that I had to summarize multiple definitions into a coherent paragraph. The unique challenge in the use case included the brevity of input text in the form of bullet points, demanding the compilation of limited information into cohesive summaries without repetition. The output was expected to be grammatically correct, logically structured, and non-repetitive.

Another hiccup was unavailability of training dataset. I delved into the use of various datasets, including DiscoFuse, and curated a custom dataset tailored to the specific requirements of our task. The objective was to fine-tune and compare the performance of candidate models, namely BERT, GPT-2, and BART.

Among the models explored, the BART un-tuned model stood out for delivering the best results in terms of logical and grammatically correct sentences. However, it exhibited a more extractive nature. On the other hand, the BERT model, when fine-tuned on DiscoFuse, showcased a higher level of abstraction but struggled with retaining facts, leading to repetitions that impacted the overall legibility of the summary.Both BART and BERT achieved similar Rouge scores, indicating substantial coverage of content and definitions. Notably, the BART fine-tuned experiment was anticipated to yield improved results based on the comparative enhancement observed in the fine-tuning of the BERT model, which is a partial module of the BART model.

In drawing conclusions from the experiments, it became evident that the multi-line definition task could benefit from an initial focus on multi-document abstractive summarization and sentence fusion. This strategic shift, from starting with individual definitions to fusing them, was proposed as a means of enhancing the overall production perspective.

In summary, my work involved an in-depth exploration of seq2seq models, the creation and fine-tuning of datasets, and the evaluation of model performances, leading to valuable insights for the ongoing development of Project Cortex.