Identifying Sentence-Level Semantic Content Units with Topic Models
Abstract
Statistical approaches to document content modelling typically focus either on broad topics or on discourse-level subtopics of a text. We present an analysis of the performance of probabilistic topic models on the task of learning sentence-level topics that are similar to facts. The identification of sentential content with the same meaning is an important task in multi-document summarization and the evaluation of multi-document summaries. In our approach, each sentence is represented as a distribution over topics, and each topic is a distribution over words. We compare the topic-sentence assignments learnt by a topic model to gold-standard assignments that were manually annotated on a set of closely related pairs of news articles. We observe a clear correspondence between automatically identified and annotated topics. The high accuracy of automatically derived topic-sentence assignments suggests that topic models can be utilized to identify (sub-)sentential semantic content units.