Identifying Sentence-Level Semantic Content Units with Topic Models

Abstract

Statistical approaches to document content modelling typically focus either on broad topics or on discourse-level subtopics of a text. We present an analysis of the performance of probabilistic topic models on the task of learning sentence-level topics that are similar to facts. The identification of sentential content with the same meaning is an important task in multi-document summarization and the evaluation of multi-document summaries. In our approach, each sentence is represented as a distribution over topics, and each topic is a distribution over words. We compare the topic-sentence assignments learnt by a topic model to gold-standard assignments that were manually annotated on a set of closely related pairs of news articles. We observe a clear correspondence between automatically identified and annotated topics. The high accuracy of automatically derived topic-sentence assignments suggests that topic models can be utilized to identify (sub-)sentential semantic content units.

@article{hennig10b,
author = {Leonhard Hennig and Thomas Strecker and Sascha Narr and Ernesto William De Luca and Sahin Albayrak},
title = {Identifying Sentence-Level Semantic Content Units with Topic Models},
journal ={Database and Expert Systems Applications, International Workshop on},
volume = {0},
issn = {1529-4188},
year = {2010},
pages = {59--63},
doi = {http://doi.ieeecomputersociety.org/10.1109/DEXA.2010.33},
publisher = {IEEE Computer Society},
address = {Los Alamitos, CA, USA},
}

Autoren:

Leonhard Hennig, Thomas Strecker, Sascha Narr, Ernesto William De Luca, Sahin Albayrak

Kategorie:

Tagungsbeitrag

Jahr:

2010

Ort:

21st International Conference on Database and Expert Systems Applications (DEXA 10), 7th International Workshop on Text-based Information Retrieval (TIR '10)