Tailoring Taxonomies for Efficient Text Categorization and Expert Finding

Abstract

Content categorization by means of taxonomies is a powerful tool for information retrieval and search technologies. It improves the accessibility of data both for humans and machines and applications of automatic data characterization can be found all over the Web. While research on automatic categorization has mainly focused on the problem of classifier design, hardly any effort has been spent on determining how many categories are actually necessary for a successful classification. Given that modern retrieval systems are based on taxonomies of tens of thousands of categories, this question is important for it will help accelerating data access. In this paper we demonstrate empirically that already small subtrees of a taxonomy often enable reliable categorization. We compare several measures for the selection of category subtrees and investigate to what extent the reduction affects the classification quality. We consider applications in classical document categorization and in the upcoming area of expert finding and report corresponding results obtained from experiments with standard benchmark data.

@INPROCEEDINGS{wetzker08b,
title={Tailoring Taxonomies for Efficient Text Categorization and Expert Finding},
author={Wetzker, R. and Umbrath, W. and Hennig, L. and Bauckhage, C. and Alpcan, T. and Metze, F.},
booktitle={Web Intelligence and Intelligent Agent Technology, 2008. WI-IAT '08. IEEE/WIC/ACM International Conference on},
year={2008},
month={Dec.},
volume={3},
number={},
pages={459-462},
abstract={Automatic content categorization by means of taxonomies is a powerful tool for information retrieval and search technologies as it improves the accessibility of data both for humans and machines. While research on automatic categorization has mainly focused on the problem of classifier design, hardly any effort has been spent on the optimization of the taxonomy size itself. However, taxonomy tailoring may significantly improve computational efficiency and scalability of modern retrieval systems where taxonomies often consist of tens of thousands of non-uniformly distributed categories. In this paper we demonstrate empirically that small subtrees of a taxonomy already enable reliable categorization. We compare several measures for the optimal selection of sub-taxonomies and investigate to what extent a reduction affects the classification quality. We consider applications in classical document categorization and in the upcoming area of expert finding and report corresponding results obtained from experiments with standard benchmark data.},
keywords={content management, information retrieval, text analysisautomatic content categorization, classical document categorization, expert finding, information retrieval, search technology, text categorization},
doi={10.1109/WIIAT.2008.179},
ISSN={}, }
Authors:
Robert Wetzker, Winfried Umbrath, Leonhard Hennig, Christian Bauckhage, Tansu Alpcan, Florian Metze
Category:
Conference Paper
Year:
2008
Location:
IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology