KNIME Extension for Apache Spark

KNIME Extension for Apache Spark is a set of nodes used to create and execute Apache Spark applications with the familiar KNIME Analytics Platform. Visual programming allows code-free big-data science, while scripting nodes allow detailed control when desired. This library of nodes enables you to:

  • Carry out data and tool blending
  • Import, export, and access data with Hive and HDFS using KNIME Analytics Platform and KNIME Server
  • Conduct predictive analytics and scoring on Apache Spark using PMML models developed with KNIME Analytics Platform (and deploy on KNIME Server)
  • Embed Java Apache Spark applications into KNIME workflows
  • Mix & match local and Hadoop executions within the same workflow

Apache Spark Node Categories

This library includes nodes to perform the following functions on Apache Spark:

  • I/O
  • Manipulation
  • Machine Learning
  • Statistics
  • Scoring


MLlib Integration

Integrate Apache Spark’s scalable machine learning library into your workflows to perform:

  • Classification (Decision Tree, Naïve Bayes, etc.)
  • Regression (Logistic Regression, Linear Regression, etc.)
  • Clustering (k-Means)
  • Collaborative Filtering (ALS)
  • Dimensionality Reduction (SVD, PCA)


Functionality

KNIME Extension for Apache Spark provides a variety of new KNIME nodes that allow you to create and execute Apache Spark applications without any programming. The new nodes offer seamless, easy-to-use data mining, scoring statistics, data manipulation, and data import/export on Apache Spark from within KNIME Analytics Platform. Integration with Apache Spark MLlib enables complex statistics and powerful machine learning in Apache Spark directly from KNIME Analytics Platform (or KNIME Server), resulting in a collection of the most popular algorithms for:

Spark nodes overview
(click on the image to see it in full size)

Usage example: K-Means clustering on Apache Spark with data from Apache Hive

The Hive to Spark node imports the results of a Hive query into an Apache Spark DataFrame, keeping the column schema information. An Apache Spark DataFrame is a dataset that is stored in a distributed fashion on your Hadoop cluster. In this example, the Spark Partitioning node first splits the DataFrame into training and test data. The training set flows into the Spark k-Means node that trains a clustering model (using Apache Spark's MLlib) on the data and hands it to the Spark Cluster Assigner node. This node uses the model to label the previously unseen test data. Finally, the Spark to Hive node stores the labeled data back into a Hive table. The Spark to Table node imports the labeled test data into KNIME Analytics Platform.

See KNIME Extension for Apache Spark in action!

We highly recommend watching this video to get a feel for what you can do with KNIME Extension for Apache Spark.

 

Compatibility

KNIME Extension for Apache Spark supports the following Hadoop distributions:

  • Hortonworks HDP 2.2 - 2.6 with Spark 1.x and Spark 2.x (as shipped with HDP)
  • Hortonworks HDP 3.1 with Spark 2.3
  • Cloudera CDH 5.3 - 5.16 with Spark 1.x and Spark 2.x (as shipped with CDH)
  • Cloudera CDH 6.1 with Apache Spark 2.4

Please see our documentation for more details.

 

For KNIME Analytics Platform 3.7 and KNIME Server 4.8 and newer versions

Please consult our KNIME Big Data Extensions Admin Guide for further information and supplementary download links


For KNIME Analytics Platform 3.6 and KNIME Server 4.7

You need to install (i) a client-side extension for KNIME Analytics Platform and (ii) the cluster-side Spark Jobserver.  Please consult the installation guide for details.

 


For KNIME Analytics Platform 3.5 and KNIME Server 4.6

You need to install (i) a client-side extension for KNIME Analytics Platform and (ii) the cluster-side Spark Jobserver.  Please consult the installation guide for details.

 

For KNIME Analytics Platform 3.4 and KNIME Server 4.5

Note: This version of KNIME Extension for Apache Spark requires a license. Please contact sales@knime.com.

You need to install (i) a client-side extension for KNIME Analytics Platform and (ii) the server-side Spark Jobserver.  Please follow the installation guide below:

For KNIME Analytics Platform 3.3 and KNIME Server 4.4

Note: This version of KNIME Extension for Apache Spark requires a license. Please contact sales@knime.com.

You need to install (i) a client-side extension for KNIME Analytics Platform and (ii) the server-side Spark Jobserver.  Please follow the installation guide below:

For KNIME Analytics Platform 3.2 and KNIME Server 4.3

Note: This version of KNIME Extension for Apache Spark requires a license. Please contact sales@knime.com.

You need to install (i) a client-side extension for KNIME Analytics Platform and (ii) the server-side Spark Jobserver.  Please follow the installation guide below:

For KNIME Analytics Platform 3.1 and KNIME Server 4.2

Note: This version of KNIME Extension for Apache Spark requires a license. Please contact sales@knime.com.

You need to install (i) a client-side extension for KNIME Analytics Platform and (ii) the server-side Spark Job Server.  Please follow the installation guide below:

For KNIME Analytics Platform 2.12 and KNIME Server 4.1

Note: This version of KNIME Extension for Apache Spark requires a license. Please contact sales@knime.com.

You need to install (i) a client-side extension for KNIME Analytics Platform and (ii) the server-side Spark Job Server. Please follow the installation guide below:

 

 


All third-party trademarks (including logos and icons) referenced remain the property of their respective owners. It does not indicate any relationship, sponsorship, or endorsement between KNIME and the respective owners. Any references made is to identify the corresponding goods or services and shall be considered nominative fair use.

LinkedInTwitterShare

What are you looking for?