We have created two corpora for Sentiment Analysis in German and Swiss German that we make available to the NLP community for free. Please find the details below.
SB-CH: Swiss German Sentiment Corpus
SB-CH is a publicly available corpus that contains 165’916 Swiss German sentences, of which 2799 are labeled by 5 annotators with “positive”, “negative”, “neutral”, “mixed”, or “unknown”. It was created by SpinningBytes in collaboration with the Zurich University of Applied Sciences (ZHAW).
Licence
All data is provided under Creative Commons License CC BY 4.0.
This means that they are free to use and distribute, even commercially, as long as appropriate credit to the reference below is given.
Reference
If you use the corpus, please make sure to reference the following publication:
- Towards a Corpus of Swiss German Annotated with Sentiment. by Ralf Grubenmann, Don Tuggener, Pius von Däniken, Jan Deriu, Mark Cieliebak. In “Proceedings of the 11th Language Resources and Evaluation Conference (LREC), 2018 (to appear)”
Description
A detailed description of the corpus and how it was constructed can be found in the reference above, as well as the README file contained in the corpus.
Instructions
In order to use the corpus, download the annotations below. Since Facebook does not allow to distribute the content of posts, the dataset only contains comment ID’s and the corresponding annotations for Facebook posts. A download script is provided, simply follow the Readme on the linked page.
Download
SB-10k: German Sentiment Corpus
SB-10k is a publicly available corpus that contains 9738 German tweets, each labeled by 3 annotators with “positive”, “negative”, “neutral”, “mixed”, or “unknown”. It was created by SpinningBytes in collaboration with the Zurich University of Applied Sciences (ZHAW).
Licence
All data is provided under Creative Commons License CC BY 4.0.
This means that they are free to use and distribute, even commercially, as long as appropriate credit to the reference below is given.
Reference
If you use the corpus, please make sure to reference the following publication:
- A Twitter Corpus and Benchmark Resources for German Sentiment Analysis. by Mark Cieliebak, Jan Deriu, Fatih Uzdilli, and Dominic Egger. In “Proceedings of the 4th International Workshop on Natural Language Processing for Social Media (SocialNLP 2017)”, Valencia, Spain, 2017
Description
A detailed description of the corpus and how it was constructed can be found in the reference above.
Instructions
In order to use the corpus, download the annotations below. Since Twitter does not allow to distribute the content of tweets, the dataset only contains tweet ID’s (first column) and the corresponding annotations (second column). A Python script to download the tweet content for the IDs can be found here*.
*On Windows, you might have to comment the “signal.alarm(…)” calls in download_tweets_api.py to get the script to work.