This workflow builds a line plot of the age distribution for men and women in Maine (US) over the last 5 years. In particular, women data is processed via Hive SQL and men data via Spark SQL. Will they blend? The whole data set is initially read from a Hadoop Hive installation. .... and yes, Spark SQL and Hive SQLdo blend!
This workflow mixes standard KNIME nodes with the Spark nodes to find the optimal parameters for a k-means clustering using the hillclimbing approach.
1. Create local Spark Context; 2. Read ratings.csv and movies.csv from movie-lens dataset into Spark (https://grouplens.org/datasets/movielens/); 3. Ask user for rating on 20 random movies to build user profile and include in training set; 4.Train Spark Collaborative Filtering Learner (Alternating Least Squares) algorithm https://www.infofarm.be/articles/alternating-least-squares-algorithm-re…; 5.
This workflow demonstrates the usage of the Spark MLlib Decision Tree Learner and Spark Predictor. It also demonstrates the conversion of categorical columns into numerical columns which is necessary since the MLlib algorithms only support numerical features and labels.
This workflow demonstrates the usage of the Hive to Spark and Spark to Hive nodes that allow you to transfer data between Apache Spark and Apache Hive.
This workflow demonstrates the usage of the Spark Compiled Model Predictor node which converts a given PMML model into machine code and uses the compiled model to predict vast amounts of data in parallel within Apache Spark.
This workflow uses a portion of the Irish Energy Meter dataset, and presents a simple analysis based on the whitepaper "Big Data, Smart Energy, and Predictive Analytics". It is intended to highlight KNIME's Big Data and Spark functionality in the 3.6 release. The workflow creates a Local Big Data Environment, loads the meter dataset to Hive, and then transfers it into Spark. It uses a series of Spark SQL nodes to create datetime fields, and then uses Spark nodes to aggregate energy usage over these datetime fields.
This workflow demonstrates the usage of the Spark MLlib to PMML node. Together with the Compiled Model Predictor and the JSON Input/Output node it can be used to model a so called lambda architecture which learns a machine learning model at scale on historical data offline and predicts events online using the learned model.
In this workflow we demonstrate how to use the KNIME Spark nodes for giving locality recommendations. For this we are using the Yelp reviews as provided by the Kaggle challenge (https://www.kaggle.com/yelp-dataset/yelp-dataset). We wanted to find good next localities (e.g., restaurants) for everyone who to date only gave one review.
This workflow demonstrates the usage of the different Spark Java Snippet nodes to read a text file from HDFS, parse it, filter it and write the result back to HDFS.
You might also want to have a look at the provided snippet templates that each of the node provides. In order to do so simply open the configuration dialog of a Spark Java Snippet node and go to the Templates tab.