Personalized Multi-Document Summarization using N-Gram Topic Model Fusion
Abstract
We consider the problem of probabilistic topic modeling for query-focused multi-document summarization. Rather than modeling topics as distributions over a vocabulary of terms, we extend the probabilistic latent semantic analysis (PLSA) approach with a bigram language model. This allows us to relax the conditional independence assumption between words made by standard topic models. We present a unified topic model which evolves from sentence-term and sentence-bigram co-occurrences in parallel. Sentences and queries are represented as probability distributions over latent topics to compute thematic and query-focused sentence features in the topic space. We find that the inclusion of bigrams improves the descriptive quality of the latent topics, and substantially reduces the number of latent topics required for representing document content. Experimental results on DUC 2007 data show an improved performance compared to a standard term-based topic model. We further find that our method performs at the level of current state-of-the art summarizers, while being built on a considerably simpler model than previous topic modeling approaches to summarization.