KNIME Extension for Apache Spark

KNIME Big Data Extensions are now open source and included in KNIME Analytics Platform.

 

KNIME Extension for Apache Spark is a set of nodes used to create and execute Apache Spark applications with the familiar KNIME Analytics Platform. Visual programming allows code-free big-data science, while scripting nodes allow detailed control when desired. This library of nodes enables you to:

  • Carry out data and tool blending
  • Import, export, and access data with Hive, HDFS, KNIME Analytics Platform, or KNIME Server
  • Conduct predictive analytics and scoring on Apache Spark using PMML models built with KNIME workflows
  • Embed Java Apache Spark applications into KNIME workflows
  • Mix & match local and Hadoop executions within the same workflow

Apache Spark Node Categories

This library includes nodes to perform the following functions on Apache Spark:

  • I/O
  • Manipulation
  • Dimensionality Reduction
  • Machine Learning (Extensive List)
  • Statistics
  • Scoring


MLlib Integration

Integrate Apache Spark’s scalable machine learning library into your workflows to perform:

  • Classification
  • Regression
  • Clustering
  • Collaborative filtering
  • Dimensionality reduction


Functionality

KNIME Extension for Apache Spark provides a variety of new KNIME nodes that allow you to create and execute Apache Spark applications without any programming. The new nodes offer seamless, easy-to-use data mining, scoring statistics, data manipulation, and data import/export on Apache Spark from within KNIME Analytics Platform.

  • Integration with Apache Spark MLlib enables complex statistics and powerful machine learning in Apache Spark directly from KNIME Analytics Platform (or KNIME Server), resulting in a collection of the most popular algorithms for:
    • Classification (decision tree, naïve Bayes, etc)
    • Regression (logistic regression, linear regression, etc.)
    • Clustering (k-means)
    • Collaborative filtering (ALS)
    • Dimensionality reduction (SVD, PCA)
  • Use PMML models built with KNIME Analytics Platform (or KNIME Server) for prediction in Apache Spark
  • Pre-process and manipulate data in Apache Spark
  • Import, export, and access data in Hive, HDFS, KNIME Analytics Platform (or KNIME Server) within Apache Spark
  • Embed existing Java Apache Spark Applications into your KNIME workflow
(click on the image to see it in full size)

Usage example: K-Means clustering on Apache Spark with data from Apache Hive

The Hive to Spark node imports the results of a Hive query into an Apache Spark DataFrame, keeping the column schema information. An Apache Spark DataFrame is a dataset that is stored in a distributed fashion on your Hadoop cluster. In this example, the Spark Partitioning node first splits the DataFrame into training and test data. The training set flows into the Spark k-Means node that trains a clustering model (using Apache Spark's MLlib) on the data and hands it to the Spark Cluster Assigner node. This node uses the model to label the previously unseen test data. Finally, the Spark to Hive node stores the labeled data back into a Hive table. The Spark to Table node imports the labeled test data into KNIME Analytics Platform.

See KNIME Extension for Apache Spark in action!

We highly recommend watching this video to get a feel for what you can do with KNIME Extension for Apache Spark.

 

Compatibility

KNIME Extension for Apache Spark supports the following Hadoop distributions:

  • Hortonworks HDP 2.2 with Apache Spark 1.2
  • Hortonworks HDP 2.3.0 with Apache Spark 1.3
  • Hortonworks HDP 2.3.4 with Apache Spark 1.5
  • Hortonworks HDP 2.4.x with Apache Spark 1.6
  • Hortonworks HDP 2.5.x with Apache Spark 1.6 and 2.0
  • Hortonworks HDP 2.6.x with Apache Spark 1.6 and  2.1
  • Hortonworks HDP 2.6.3 with Apache Spark 1.6 and  2.2
  • Cloudera CDH 5.3 with Apache Spark 1.2
  • Cloudera CDH 5.4 with Apache Spark 1.3
  • Cloudera CDH 5.5 with Apache Spark 1.5
  • Cloudera CDH 5.6 with Apache Spark 1.5
  • Cloudera CDH 5.7 with Apache Spark 1.6, 2.0 and 2.1
  • Cloudera CDH 5.8 with Apache Spark 1.6, 2.0 and 2.1
  • Cloudera CDH 5.9 with Apache Spark 1.6, 2.0 and 2.1
  • Cloudera CDH 5.10 with Apache Spark 1.6, 2.0 and 2.1
  • Cloudera CDH 5.11 with Apache Spark 1.6, 2.0 and 2.1
  • Cloudera CDH 5.12 with Apache Spark 1.6, 2.0, 2.1 and 2.2
  • Cloudera CDH 5.13 with Apache Spark 1.6, 2.0, 2.1 and 2.2

Installation steps

For KNIME Analytics Platform 3.5 and KNIME Server 4.6

You need to install (i) a client-side extension for KNIME Analytics Platform and (ii) the cluster-side Spark Jobserver.  Please consult the installation guide for details.

 

For KNIME Analytics Platform 3.4 and KNIME Server 4.5

Note: This version of KNIME Extension for Apache Spark requires a license. Please contact sales@knime.com.

You need to install (i) a client-side extension for KNIME Analytics Platform and (ii) the server-side Spark Jobserver.  Please follow the installation guide below:

For KNIME Analytics Platform 3.3 and KNIME Server 4.4

Note: This version of KNIME Extension for Apache Spark requires a license. Please contact sales@knime.com.

You need to install (i) a client-side extension for KNIME Analytics Platform and (ii) the server-side Spark Jobserver.  Please follow the installation guide below:

For KNIME Analytics Platform 3.2 and KNIME Server 4.3

Note: This version of KNIME Extension for Apache Spark requires a license. Please contact sales@knime.com.

You need to install (i) a client-side extension for KNIME Analytics Platform and (ii) the server-side Spark Jobserver.  Please follow the installation guide below:

For KNIME Analytics Platform 3.1 and KNIME Server 4.2

Note: This version of KNIME Extension for Apache Spark requires a license. Please contact sales@knime.com.

You need to install (i) a client-side extension for KNIME Analytics Platform and (ii) the server-side Spark Job Server.  Please follow the installation guide below:

For KNIME Analytics Platform 2.12 and KNIME Server 4.1

Note: This version of KNIME Extension for Apache Spark requires a license. Please contact sales@knime.com.

You need to install (i) a client-side extension for KNIME Analytics Platform and (ii) the server-side Spark Job Server. Please follow the installation guide below:

 


All third-party trademarks (including logos and icons) referenced remain the property of their respective owners. It does not indicate any relationship, sponsorship, or endorsement between KNIME and the respective owners. Any references made is to identify the corresponding goods or services and shall be considered nominative fair use.