An unsupervised hierarchical method for automated document categorization

Abstract

We propose a hierarchical approach to document categorization that requires no pre-configuration and maps the semantic document space to a predefined taxonomy. The fact that we utilize search engines to train our hierarchical classifier makes our approach more flexible than existing solutions which rely on (human) labeled data and are bound to a specific domain. Moreover, we show that the structural information given by the taxonomy allows for a context aware construction of search queries and leads to higher tagging accuracy. We test our approach on different benchmark datasets and evaluate its performance on the single- and multi-tag assignment task. The experimental results show that our solution is as accurate as supervised classifiers for web page classification and still performs well when categorizing domain specific documents.

@INPROCEEDINGS{wetzker07b,
  author = {Robert Wetzker, Tansu Alpcan, Christian Bauckhage, Winfried Umbrath, and Sahin Albayrak},
  title = {An unsupervised hierarchical method for automated document categorization},
  booktitle = {Proceedings of the IEEE/WIC/ACM Web Intelligence 2007},
  year = {2007},
  publisher = {IEEE Computer Society Press},
  note = {to appear},
  owner = {wetzker},
  timestamp = {2007.04.05}
}
Authors:
Robert Wetzker, Tansu Alpcan, Christian Bauckhage, Winfried Umbrath, Sahin Albayrak
Category:
Conference Paper
Year:
2007
Location:
Proceedings of the IEEE/WIC/ACM Web Intelligence 2007, Silicon Valley, USA