Collection Cookbook

Tue, 03/17/2015 - 09:45 tobias.koetter

Today we would like to show you how to work with collection cells in KNIME. You might have already come across these cells that represent a collection of cells e.g. a collection of strings representing a frequent item set or items in a transaction.

The workflow associated with this post is available for download in the attachment section of this post or in the EXAMPLES server under 003_Preprocessing/003004_CollectionCookbook_blog.

 

Collection Types

KNIME supports two different types of collection cells that differ in the handling of duplicates and the order of elements.

  • List: A list cell corresponds to the mathematical sequence in that it contains all elements (including duplicates) in the order they have been added to the collection cell.
  • Set: A set cell corresponds to the mathematical set in that it contains each element only once in an arbitrary order.

Example

Let us add the following elements in this order to a collection cell 1, 2, 2, 3, 3, 4, 5. A list cell would contain all elements in the same order in which they are added e.g. (1, 2, 2, 3, 3, 4, 5) whereas the set cell would contain only unique elements in an arbitrary order e.g. {3, 1, 2, 4, 5}.

The Collection Types section of the workflow demonstrates the creation and behaviour of the two different set types.

 

Collection creation and conversion

Collection cells are either created manually by the user or are the result of a KNIME node such as the Item Set Finder. To manually create a collection cell you can combine either the cells of several columns or the cells of several rows into a single collection cell.

Row wise

You can use the GroupBy node to combine the cells of one column of several rows into a single collection cell. The number of rows and therefore the number of the elements depends on the selected group columns. If you do not select a group column, all rows are added to a single collection cell. In the aggregation section you select the column you want to use and the collection type you want to create. You can choose to create a list cell that contains all elements in the order they appear in the corresponding group or have them sorted based on their value. You can also create a set cell that contains only the unique values of the selected aggregation column.

The node that converts a collection cell back into individual rows is the Ungroup node. For each collection element it creates a new data row. If the collection cell is a list cell, the row order from top to bottom reflects the order of the collection elements from left to right. If you ungroup two collection cells with different numbers of elements simultaneously the missing elements are filled with missing values.

Notice that by using the GroupBy node and the Ungroup node you can collapse a KNIME table into a single row and expand it again without losing any information. Simply use the GroupBy node with List as the aggregation method for all columns to aggregate all rows into a single row. Later on you can use the Ungroup node on all collection columns to expand the row into a table.

Column wise

To combine several columns of a single row into a new collection cell you can use either the Column Aggregator or the Create Collection Column node.

The Create Collection Column node allows you to create a collection cell from a set of columns that are selected either manually or based on their name or type. Depending on the node setting ("Create a collection type 'set'") the node creates either a list cell, which contains the elements in the order they appear in the selected columns from left to right, or a set cell, which contains only the unique elements of the columns.

The Column Aggregator like the GroupBy node also allows you, in addition to many other aggregation methods, to create either a list cell, which contains all elements in the order of the selected columns or a cell in which all the elements are sorted based on their value. You can also create a set cell which contains only the unique values of the selected columns.

The node that converts a collection cell back into individual columns is the Split Collection Column node. It splits the single elements up into columns. If the collection cell is a list cell, the order of the elements is maintained when creating the columns.

The Collection creation and conversion section of the workflow that demonstrates the row and column wise creation and conversion of collection cells is depicted below.

 

Working with collections

KNIME provides several nodes that work with collection cells. For example the Column Aggregator node and the GroupBy node provide aggregation methods to create collection cells but also methods to perform set operations e.g. union, intersection, exclusive-or and element counting.

The Create Bit Vector node not only allows you to create a bit vector from multiple columns but also from a single collection column e.g. an item list of a shopping cart. Each unique element is assigned a position in the resulting bit vector resulting in bit vectors with a length equal to the number of unique elements in all collection cells.

The Item Set Finder (Borgelt) node provides several algorithms to search for frequently co-occurring items in a given collection column. The result of the node  ‒ the discovered frequent itemsets ‒ are represented as a set cell that contains the frequently co-occurring items. The Subset Matcher node allows you to search to see if a given subset, such as a discovered frequent itemset, exists within a given collection cell e.g. an item list. For example, you can use the node to discover all transactions that contain a specific frequent item set.

The Working with collections section of the workflow, depicted below, contains a subset of the nodes that support collection cells.

 

Requirements

- KNIME Analytics Platform 2.11
- KNIME Itemset Mining Extension

Further Reading:

Files