Document clustering

This workflow shows how to import textual data, preprocess documents by filtering and stemming, transform documents into a bag of words and document vectors, and finally cluster the documents based on their numerical representation.

Document clustering

 

Document Classification

This workflow shows how to import textual data, preprocess documents by filtering and stemming, transform documents into a bag of words and document vectors and finally build a predictive model to classify the documents. It also contains the corresponding deployment workflow.

Document Classification

 

Sentiment Classification with NGrams

This workflow shows how to import text from a csv file, convert it to documents, preprocess the documents and transform them into numerical document vectors consisting of single word and 2-gram features.
Finally two predictive models are trained on the vectors to predict the sentiment class of the documents. The two models are then compared via a ROC curve.

epub JPEG Romeo Juliet

The challenge here is to blend together text and image data. Text data is in epub format while images are in JPEG format. The goal is to build the network of interactions in one of Shakespear most famous tragedies: Romeo and Juliet. The network of interactions is then dispayed as a graph, where each node represents a character. Each node then displays the character JPEG image. epub with JPEG. Will they blend? ... and yes! They blend.

Topic Detection LDA

This workflow extracts topics from the "Romeo & Juliet" epub book using the Topic Extractor (Parallel LDA) node. It reads textual data from a table and converts them into documents. The documents are then preprocessed, i.e. tagged, filtered, lemmatized, etc. After that, the Topic Extractor node can be applied to the preprocessed documents. However, the node requires users to input the number of topics that should be extracted beforehand. After pre-processing, the Topic Extractor node can be executed and a tag cloud is created to visualize the topics' terms.

Sentiment Analysis Lexicon Based Approach

This workflow shows how to perform a lexycon based approach for sentiment analysis of IMDB reviews dataset. The dataset contains movie reviews, previously labelled as positive/negative. The lexicon based approach assigns a sentiment to each word in a text based on dictionaries of positive and negative words. A sentiment score is then calculated for each document as: (number of positive words - number of negative words) / total number of words.

RSS Feed Reader

This workflows downloads the most recent New York Times news feeds, extracts the titles and text, recognizes named entities in the news and visualizes these named entities as Tag Cloud.

RSS Feed Reader

 

Subscribe to Text Processing