Personalized Multi-Document Summarization using N-Gram Topic Model Fusion

Abstract

We consider the problem of probabilistic topic modeling for query-focused multi-document summarization. Rather than modeling topics as distributions over a vocabulary of terms, we extend the probabilistic latent semantic analysis (PLSA) approach with a bigram language model. This allows us to relax the conditional independence assumption between words made by standard topic models. We present a unified topic model which evolves from sentence-term and sentence-bigram co-occurrences in parallel. Sentences and queries are represented as probability distributions over latent topics to compute thematic and query-focused sentence features in the topic space. We find that the inclusion of bigrams improves the descriptive quality of the latent topics, and substantially reduces the number of latent topics required for representing document content. Experimental results on DUC 2007 data show an improved performance compared to a standard term-based topic model. We further find that our method performs at the level of current state-of-the art summarizers, while being built on a considerably simpler model than previous topic modeling approaches to summarization.

@INPROCEEDINGS{hennig10a,
title={Personalized Multi-Document Summarization using N-Gram Topic Model Fusion},
author={Hennig, L. and Albayrak, S.},
booktitle={Proceedings of LREC '10, 1st Workshop on Semantic Personalized Information Management (SPIM 2010)},
year={2010},
pages={28--34},
url={http://www.lrec-conf.org/proceedings/lrec2010/workshops/W7.pdf},
  address = {Valletta, Malta},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis and Mike Rosner and Daniel Tapias},
  publisher = {European Language Resources Association (ELRA)},
  isbn = {2-9517408-6-7},
  language = {english}, }

Authors:

Leonhard Hennig, Sahin Albayrak

Category:

Conference Paper

Year:

2010

Location:

Int. Conf. on Language Resources and Evaluation (LREC 2010), 1st Workshop on Semantic Personalized Information Management (SPIM 2010)