Automated Ground Truth Data Generation for Newspaper Document Images
Abstract
In document image understanding, public groundtruthed datasets are an important part of scientific work. They do not only helpful for developing new methods, but they are also a point of intersection allowing to compare the methods performance without need to implement it. For document image understanding several datasets exists, each having its own pros and cons. Generating these datasets is time consuming and costly work and therefore each existing and new dataset is valuable. In this paper we propose a way to generate a ground-truthed dataset for newspapers. The ground truth in focus is layout analysis ground truth. The proposed two step approach consists of a layout generating module and an image matching module allowing to match the ground truth information from the synthetic data to the scanned version. Using the "MyNews" system, newspaper layouts are generated using a news corpus. The output consists of a digital newspaper (PDF file) and an XML file containing geometric and logical layout information. In the second step, the PDF files are printed and scanned. Then the scanned document image is aligned with the synthetic image obtained by rendering the PDF. Finally the geometric and logical layout ground truth is mapped onto the scanned image.