deep learning / supervised and unsupervised large-scale learning / generalized linear models / support vector machines / decision trees & random forests / clustering / distributed computing / Spark & Hadoop / sparse regression / reinforcement learning / map-reduce / and many many more.
Unstructured text documents such as product reviews, websites or project reports contain infinite valuable information. Smart algorithms can unearth these treasures, by interpreting and “understanding” the content of huge datasets automatically.
- Sentiment Analysis: distinguish positive and negative posts in social media
- Main Text Extraction: extract and summarize most important content in websites
- Named Entity Extraction: identify companies, organizations, persons in unstructured texts
- Machine Translation: translate text into arbitrary languages
Example: Sentiment Analysis
Sentiment Analysis decides for a text document whether it is positive, negative, mixed or neutral tone. It is used, for instance, to evaluate customer feedback in social media, where huge amounts of Tweets, facebook posts and blog entries are analyzed automatically.
We identified a set of approx. two million representative “features” that are indicative of the sentiment of a text. Sample features are: bag-of-words (which words are used), word-embeddings (semantic representation of each word as a vector), n-grams (consecutive sequences of several words), part-of-speech tags (occurrences of nouns, verbs or adjectives), usage of negation, sentiment-information of single words and combinations, use of punctuation, etc. These features are fed into a combination of SVM classifiers and Random Forests. The system was trained on ten-thousands of sample documents which were manually labeled as positive, neutral or negative, plus additional billions of unlabeled text-documents.
Our classifier achieves an accuracy of 68% on the target data. This is “state-of-the-art”, compared to other commercial tools, as we have shown in a scientific evaluation (cf. http://ceur-ws.org/Vol-1096/paper4.pdf).
Our technology proved its quality in the official SemEval competition, both 2014 and 2015. SemEval is the international annual competition for semantic text analysis. In both years, our solution achieved rank 8 in the final ranking, being the only tool that was among the top-ten both years (cf. http://alt.qcri.org/semeval2015/task10 and the published papers on our designed systems PDF 2014, PDF 2015).
Numerical Data Analysis
Data is everywhere, and the amount of data increases everyday. We handle “everything” that comes as numbers and tables: sales reports, monthly revenues, customer feedback form, stock exchange rates, sensor data, etc.
- Trend prediction and forecasting: will the numbers go up or down in the future
- Anomaly detection: identify outliers or radical changes
- Clustering: find patterns of similar entities in huge datasets
- Data Curation: cleanup data for further processing
Audio signals can come from TV or radio, conference presentations, or Youtube videos. Most well-known is speech recognition, where the audio signal it transmitted into written text.
- Speech Recognition: transmit spoken words into written text
- Speaker Recognition: identifiy the current speaker
- Background Noise Removal: eliminate unintended signals
Image and Video Analysis
Images are everywhere, and interpreting their content automatically can yield interesting insights for business processes. One question that we were ask recently was: How often appears my logo in TV shows?
- Face Recognition: identify a person in a picture
- Sentiment Analysis: is the emotion of an image positive or negative
- Logo Detection: find appearances of company logos in images and videos