Distance Measure (Developers Guide)

KNIME v2.10 offers a new [[Distance Measure]] framework. Distance measures are used in various nodes, most prominently in nodes for cluster analysis (like k-medoids or hierarchical clustering) but also in other utility nodes like the similarity search node. One of the key aspects was the separation of distance measure definition and the use in the various nodes. This allows new measure to be defined without modifying the nodes using them.

This article describes the framework interfaces, how to use of a distance measure in the code and everything which is needed to plugin your own distance measurements.

The main contributions are:

  • Parametrized distance measures.
  • Computation only on a definable set of columns.
  • Distance Measures are treated as Port Objects and can therefore be used/ shared over several nodes.
  • A couple of new distance measures like the Malahanobis distance, or the Java Distance which allows the definition of a custom distance computation directly within a KNIME workflow.
  • Extendable using the Eclipe [[Extension Point]] mechanism.
  • A lot of framework elements which make the extension very easy. There are helper classes to categorize distances, to easily create KNIME nodes for distance configuration and to plugin documentation for customized distances.

For a description of the new nodes, and more detailed information about distance measures in general it is referred to [[Distance Measure]].

Usage in Nodes

A distance measure is technically a KNIME port object and can be received like other port objects from the input object array. Some distance measures may need access to the flow variables, therefore make sure that the NodeModel does implement the org.knime.base.util.flowvariable.FlowVariableProvider interface. During the NodeMode#configure method you should validate the input table spec.

... @Override protected DataTableSpec[] configure(final PortObjectSpec[] inSpecs) throws InvalidSettingsException { DataTableSpec spec = (DataTableSpec)inSpecs[0]; ((DistanceMeasurePortSpec)inSpecs[1]).validate(spec); //... } ...

The usage of the distance measure is then straight forward.

... @Override protected PortObject[] execute(final PortObject[] inData, final ExecutionContext exec) throws Exception { final BufferedDataTable in0 = (BufferedDataTable)inData[0]; DistanceMeasure<?> distanceMeasure = ((DistanceMeasurePortObject)inData[1]).createDistanceMeasure(in0.getDataTableSpec(), this); //... within some loop DataRow a; DataRow b; try { double distance = distanceMeasure.computeDistance(a, b); } catch (DistanceMeasurementException e) { // thrown if the data contains values which are not supported by the distance measure, // for instance missing values. } } ...

Extension

General

The extension point (org.knime.distmatrix.distanceMeasures) and necessary interfaces are located in the org.knime.distmatrix bundle in the package org.knime.distance.

A not categorized distance measure extension consists of an implementation of the following 3 classes:

  • org.knime.distance.DistanceMeasureFactory<C>
    • Provides the id for the distance and defines the DataValue types for which the distance is defined.
    • The factory for DistanceMeasureConfig.
  • org.knime.distance.DistanceMeasureConfig<M>
    • Contains the distance configuration.
    • The factory for Distance Measures.
  • org.knime.distance.DistanceMeasure<T>​
    • The actual distance function.
    • #getIndex() returns the indexes to be used for the distance computation.
    • Have also a look on the helper methods of this class like #checkNotMissing(...).

And an entry in the plugin.xml to extend the extension point, the id should be the full qualified class name by convention. More information about the eclipse extension point mechanism can be found here [[Extension Point]].

<extension point="org.knime.distmatrix.distanceMeasures"> ... <DistanceMeasure factoryClass="org.knime.distance.measure.numerical.lnorm.LNormDistanceFactory" id="org.knime.distance.measure.numerical.lnorm.LNormDistanceFactory"> </DistanceMeasure> ... </extension>

Distance Categories

The distance category classes in the org.knime.distance.category package help to create KNIME nodes which group distance measures. It provides a general UI for the column selection and for the distance specific configuration. Furthermore it collects the documentation of the grouped distances and shows them in the KNIME editor.

Define a category

 A category defines its own DistanceCategoryFactory class (a subclass of DistanceMeasureFactory), which is the parent class for assigned distance measures. The constructor defines the common set of DataValue classes for all sub distances and which amount of columns they can handle - either single or multiple columns.

public abstract class NumericalDistanceMeasureFactory<C extends DistanceCategoryConfig<?>> extends DistanceCategoryFactory<C> { /** * Constructor. */ protected NumericalDistanceMeasureFactory() { super(ColumnQuantifier.MULTIPLE, DoubleValue.class); } }

To create a KNIME node for this category is done by an extension of the org.knime.distance.category.GenericDistanceNodeFactory<F> class instead the usual NodeFactory. This class has to provide a default ID of a sub distance.

public class NumericalDistanceNodeFactory extends GenericDistanceNodeFactory<NumericalDistanceMeasureFactory<?>> { /** * Constructor. */ public NumericalDistanceNodeFactory() { super(LNormDistanceFactory.ID); } }

Like other KNIME Nodes it should provide an <FactoryName>.xml file in the same package as the factory containing the documentation. The documentation of the sub distances are inserted at the end of the fullDescription part The following list shows the default categories (The packages are also a good source for code examples):

  • org.knime.distance.measure.numerical.NumericalDistanceMeasureFactory<C>
  • org.knime.distance.measure.string.StringDistanceMeasureFactory<C>
  • org.knime.distance.measure.bitvector.BitVectorDistanceMeasureFactory<C>
  • org.knime.base.distance.bytevector.ByteVectorDistanceMeasureFactory<Config>

​Adding a Distance to a Category

If a distance measure extends such a factory instead of the DistanceMeasureFactory it is automatically assigned to that category. Adding a new distance for example to the String Distances node is done by extending the corresponding factory class. There is a replacement of the configuration class to be implemented, but the implementation details are the same as before, which is also the case for the DistanceMeasure class.  

  • org.knime.distance.category.DistanceCategoryConfig<M>

The documentation of a sub distance is consistent to a node documentation contained in a <FactoryName>.xml file in the same package as the distance measure factory. The xml file must declare the http://knime.org/distance/v1.0 namespace (The schema is located in the org.knime.distance package).

<?xml version="1.0" encoding="utf-8"?> <distance name="Euclidean Distance" xmlns="http://knime.org/distance/v1.0"> <shortDescription href="http://en.wikipedia.org/wiki/Euclidean_distance" intro="Euclidean distance the square root of the sum of the square of differences of each coordinate of the byte vectors." /> <fullDescription> <p> The Euclidean distance is also known as L<sub>2</sub> distance/l<sub>2</sub> norm or rectilinear distance. Euclidean distance is the "ordinary" distance between two points that one would measure with a ruler. </p> </fullDescription> </distance>

@Configuration and @Property annotation

Since the configuration classes are mainly just data containers with no actual logic a set of annotations have been introduced to make the save/load of the internal status transparent. Inherited fields are not considered during the processing.

  • org.knime.distance.util.propertyresolver.Configuration
    • Marks a class to be load/saveable through the annotation processor
  • org.knime.distance.util.propertyresolver.Property
    • Marks a field for beeing automatically serialized.
    • value the property key.
    • required determines if the property must exist.
    • typeHint determines the type of the collection elements (Only for collections, not for arrays).
    • immutable should the set collection be immutable (Only for collections, not for arrays)
    • Currently supported types:
      • double, int, long, boolean, String; (incl the array types)
      • Enum, EnumSet
      • Set

An example

@Configuration public class PNormDistanceConfig extends DistanceCategoryConfig<PNorm> { @Property("p-norm order") private double m_order = 1d; … }

The double m_order is automatically serialized if necessary. Developers may overwrite the annotation processing by overwriting the DistanceMeasureConfig#loadConfigDistanceMeasureConfig#saveConfig, DistanceCategoryConfig#loadConfigDistanceCategoryConfig#saveConfig methods.