Word Embeddings

Word embeddings are used in Natural Language Processing (NLP) to map words to vector representations. They are used, for instance, in deep learning algorithms for named entity extraction, sentiment analysis or chatbots.

Licence

All word embeddings are provided under Creative Commons License CC BY 4.0.
This means that they are free to use and distribute, even commercially, as long as appropriate credit to the reference below is given.
Human-readable format: Link
Licence Contract: Link

Reference

If you use any of the word embeddings, please make sure to reference at least one of the following publications:

  • A Twitter Corpus and Benchmark Resources for German Sentiment Analysis. by Mark Cieliebak, Jan Deriu, Fatih Uzdilli, and Dominic Egger. In “Proceedings of the 4th International Workshop on Natural Language Processing for Social Media (SocialNLP 2017)”, Valencia, Spain, 2017
  • Leveraging Large Amounts of Weakly Supervised Data for Multi-Language Sentiment Classification by Jan Deriu, Aurelien Lucchi, Valeria De Luca, Aliaksei Severyn, Simon Müller, Mark Cieliebak, Thomas Hoffmann, and Martin Jaggi. In “Proceedings of the 26th International World Wide Web Conference (WWW-2017), Perth, Australia, 2017

Overview

We provide word embeddings for various languages. The following table gives an overview of the available embeddings.

News Tweets Wikipedia
English
German
French
Italian
Spanish
Dutch

We trained our word embeddings on different text types, such as Tweets and Wikipedia. The text type influences how the embeddings perform for the NLP task at hand. For instance, in the case of sentiment analysis, word embeddings trained on News or Tweets tend to achieve better results than those trained on Wikipedia. For a detailed analysis of how to select proper word embeddings, see the following research article:
Potential and Limitations of Cross-Domain Sentiment Classification, by Dirk von Grünigen, Martin Weilenmann, Jan Deriu, and Mark Cieliebak (SocialNLP-2017).

We provide pre-trained word embeddings with different vector lengths (e.g. 52 and 200 dimensions). Typically, higher dimensions will allow for better quality, but need much more space.

Instructions

The embeddings are stored either in a folder or as a standalone file. The folder structure consists of:

  • bigram: finds bi-grams in a sentence
  • trigram: given a sentence where bi-grams are found this transformer finds trigrams
  • config.json: shows the hyperparameters used to creat the word embeddings
  • embedding_file: main file with the corresponding vector for each word
  • embedding_matrix.npy: numpy matrix which encodes the embeddings, each row represents one vector
  • vocabulary.pickle: index that maps each word to a unique id (id represents the row where the vector is stored in the embedding_matrix.npy)

In case the download only consists of a single file, that file then is the same as the above mentioned embedding_file.

Unless mentioned otherwise, our word embeddings are trained with Word2Vec.

Word Embeddings from Tweets

English

Download Word Embeddings trained with Word2Vec on 200 million English Tweets using 200 dimensions.

Download Word Embeddings trained with Word2Vec on 590 million English Tweets using 52 dimensions.

German

Download Word Embeddings trained with Word2Vec on 200 million German Tweets using 200 dimensions.

Download Word Embeddings trained with Word2Vec on 300 million German Tweets using 52 dimensions.

French

Download Word Embeddings trained with Word2Vec on 300 million French Tweets using 52 dimensions.

Italian

Download Word Embeddings trained with Word2Vec on 200 million German Tweets using 200 dimensions.

Download Word Embeddings trained with Word2Vec on 300 million German Tweets using 52 dimensions.

Spanish

Download Word Embeddings trained with Word2Vec on 200 million Spanish Tweets using 200 dimensions.

Multilingual

Download Word Embeddings trained with Word2Vec on 300 million multilingual Tweets using 52 dimensions.

Download Word Embeddings trained with Word2Vec on 800 million multilingual Tweets using 200 dimensions.

Word Embeddings from Wikipedia articles

English

Download Word Embeddings trained with Word2Vec on 4.5 million English Wikipedia articles using 200 dimensions.

German

Download Word Embeddings trained with Word2Vec on 2 million German Wikipedia articles using 200 dimensions.

Download Word Embeddings trained with Word2Vec on 2 million German Wikipedia articles using 52 dimensions.

Italian

Download Word Embeddings trained with Word2Vec on 1.3 million Italian Wikipedia articles using 200 dimensions.

Download Word Embeddings trained with Word2Vec on 1.3 million Italian Wikipedia articles using 52 dimensions.

Sentiment Corpus

We also offer a free annotated corpus for text sentiment of German tweets, available for Download. We provide tweet IDs and labels.
Please refer to the README.md as well as to the Annotator_Instructions.pdf.

Contact form

captcha