Language-Independent Twitter Sentiment Analysis


Millions of tweets posted daily contain opinions and sentiment of users in a variety of languages. Sentiment classification can benefit companies by providing data for analyzing customer feedback for products or conducting market research. Sentiment classifiers need to be able to handle tweets in multiple languages to cover a larger portion of the available tweets. Traditional classifiers are however often language specific and require much work to be applied to a different language. We analyze the characterstics and feasibility of a language-independent, semi-supervised sentiment classification approach for tweets. We use emoticons as noisy labels to generate training data from a completely raw set of tweets. We train a Naive Bayes classifier on our data and evaluate it on over 10000 tweets in 4 languages that were human annotated using the Mechanical Turk platform. As part of our contribution, we make the sentiment evaluation dataset publicly available. We present an evaluation of the performance of classifiers for each of the 4 languages and of the effects of using multilingual classifiers on tweets of mixed languages. Our experiments show that the classification approach can be applied effectively for multiple languages without requiring extra effort per additional language.

author = {...},
title = {...},
booktitle = {...},
year = {...},
isbn = {...},
pages = {...},
location = {...},
doi = {...},
publisher = {...},
address = {...},
Sascha Narr, Michael Hülfenhaus, Sahin Albayrak
KDML, LWA 2012, Dortmund, Germany