SPIGA – A Multilingual News Aggregator
Abstract
News aggregation web sites collect and group news articles from a multitude of sources in order to help users navigate and consume large amounts of news material. In this context, Topic Detection and Tracking (TDT) methods address the challenges of identifying new events in streams of news articles, and of threading together related articles. We propose a novel model for a multilingual news aggregator that groups together news articles in different languages, and thus allows users to get an overview of important events and their reception in different countries. Our model combines a vector space model representation of documents based on a multilingual lexicon of Wikipedia-derived concepts with named entity disambiguation and multilingual clustering methods for TDT. We describe an implementation of our approach on a large-scale, real-life data stream of English and German newswire sources, and present an evaluation of the Named Entity Disambiguation module, which achieves state-of-the-art performance on a German and an English evaluation dataset.