For Developers: Integration of Custom Tagger

This tutorial is about how to create and integrate a custom tagger node to tag words/terms in texts, like any other tagger nodes provided by the KNIME Textprocessing plugin. The tutorial is organized as follows. In the first section it is described how to set up KNIME SDK, install all required features and create a KNIME project. In the second section it is described how to implement a tagger node, which classes to extend, and how to use previously integrated custom tag sets. It is assumed, that basic KNIME concepts of node development are known and understood. If you are new to KNIME node development it is strongly recommended to read the KNIME developer guide (http://tech.knime.org/developer-guide) first. Especially the sections about the KNIME NodeFactory, NodeModel, and NodeDialog will become useful later on. The NodeView section is not absolutely necessary. Furthermore it is assumed that basic concepts of KNIME Textprocessing TagSets are known. If you are new to KNIME Textprocessing TagSet or Tagger development it is recommended to read the tutorial about the integration of custom TagSets beforehand (http://tech.knime.org/for-developers-integration-of-custom-tag-sets). The last section is about how to use the tagger in KNIME workflows.

KNIME Setup

This section describes how to set up KNIME SDK, install all required features and create a KNIME project to integrate a custom tag set. For this tutorial KNIME SDK version 2.8 was used.

First download and install the KNIME SDK from the download section of the KNIME website. The KNIME SDK version is required, since tagger nodes, like any other KNIME nodes have to be integrated via a KNIME extension point. This means that Java classes have to be created, that implement and extend certain interfaces and abstract classes in order to integrate custom Java code into KNIME. Once KNIME SDK has been downloaded and installed, start KNIME SDK, if necessary with customized arguments in the knime.ini.

Second install the KNIME Textprocessing feature. KNIME features or plug-ins may be installed via the eclipse update mechanism, which is described in detail here: http://www.knime.org/downloads/update. The Textprocessing feature can be found under the KNIME Labs Extensions section.

KNIME Project

Once the Textprocessing feature has been installed and KNIME SDK has been restarted a new KNIME project has to be created. Therefore right click in the eclipse Package Explorer view, click New->Other and select Create a new KNIME Node-Extension. It is recommended to use the Node-Extension also for creating a new TagSet extension. Click Next to come to the Create new KNIME Node-Extension view. Specify the project name, a name for the tagger node to create, a Java package name, node vendor name and a description. Uncheck Include sample code in generated classes at will. The Node-Extension wizard will create classes for the new node. Finally click Finish to generate the project. The figure below shows a filled Node-Extension wizard view. In this example a sentiment tagger, assigning tags of the sentiment tag set, described in the “Integration of Custom Tag Sets” tutorial (http://tech.knime.org/for-developers-integration-of-custom-tag-sets) is created, thus the project is named SentimentTagging.

 

In this example the new project contains a Java package org.knime.example as well as the classes SentimentTaggerNodeDialog, SentimentTaggerNodeFactory, SentimentTaggerNodeModel, SentimentTaggerNodePlugin, and SentimentTaggerNodeView. Furthermore there exists a file SentimentTaggerNodeFactory.xml containing the node description, which is shown in the Node Description view of KNIME. The figure below shows the new project in the eclipse Package Explorer view.

 

To check if the project was created properly and the new node has been registered, run KNIME and see if the new node appears in the Node Repository view. To run KNIME click Run->Run Configurations and create a new run configuration by double clicking Eclipse Application. Select Run a product and choose org.knime.product.KNIME_PRODUCT from the drop down menu. In the Arguments tab Java parameters can be specified, e.g. Xmx and Xms. The figure below shows the run configuration of the sentiment example.

 

Finally click run to start KNIME. In the Node Repository view the new node shows up if everything is working as it should, show in the figure below.

 

If KNIME is running and the node shows up the project was set up properly, the node has been registered and can be used. Now everything is prepared to create and integrate a custom tag set. Close KNIME to proceed.

TagSet Preliminaries

It is assumed that the example Sentiment TagSet has been created and registered, which is described in the tutorial “Integration of Custom TagSets” (http://tech.knime.org/for-developers-integration-of-custom-tag-sets). Tags of this TagSet will be assigned by the Sentiment Tagger node, which is described later on. The example Sentiment TagSets contains of the two Java classes SentimentTag and SentimentTagSet, both contained in the package tagset, as can be seen in the figure below.

 

Integration of a Tagger Node

In this section it is described how to implement and integrate a KNIME Textprocessing tagger node, assuming that the node stub has already been created and the example Sentiment TagSet has been integrated, as described in the section above.

Tagger Factory

The Sentiment Tagger node consists of a NodeModel, in which the tagging of words is applied, and a NodeDialog, which allows to specified whether the tagged entities should be set unmodifiable or not. Terms, that have been set unmodifiable by a tagger node will not be filtered or changed in any way by any preprocessing node.  This can be useful when e.g. named entities are recognized and tagged and these names should not be stemmed or filtered by any subsequent node. A NodeView on the other hand is not neede, thus the class SentimentTaggerNodeView can be deleted and the class SentimentTaggerNodeFactory be changed, in a way that its method getNrNodeViews() returns 0 and createNodeView(…) returns null.

/**
 * {@inheritDoc}
 */
@Override
public int getNrNodeViews() {
    return 0;
}
 
/**
 * {@inheritDoc}
 */
@Override
public NodeView createNodeView(final int viewIndex,
    final SentimentTaggerNodeModel nodeModel) {
    return null;
}

Tagger Settings

First it is described how to implement the dialog of the node, to save the specified settings and make use of them in the NodeModel.

The dialog of the Sentiment Tagger node consists of only one checkbox by which the user can specify whether the tagged terms have to be set unmodifiable or not. Therefore the dialog component DialogComponentBoolean is added in the constructor of the dialog class. The underlying SettingsModel is a SettingsModelBoolean. It is assumed that the concepts of dialog components and SettingsModel for the creation of KNIME node dialogs is known and understood. A detailed description of these concepts can be found here: http://tech.knime.org/default-dialog-components. The next code listing shows the class SentimentTaggerNodeDialog with a static method to create the SettingsModel and a dialog component added in the constructor.

public class SentimentTaggerNodeDialog extends DefaultNodeSettingsPane {
       static final String CFGKEY_UNMODIFIABLE = "Unmodifiable";
       static final boolean DEFAULT_UNMODIFIABLE = true;
 
    /**
     * Creates and returns a
     * {@link org.knime.core.node.defaultnodesettings.SettingsModelBoolean}
     * containing the user settings whether terms representing named entities
     * have to be set unmodifiable or not.
     *
     * @return A {@link SettingsModelBoolean} containing the terms
     * unmodifiable flag.
     */   
     static SettingsModelBoolean createSetUnmodifiableModel() {
        return new SettingsModelBoolean(
          CFGKEY_UNMODIFIABLE,  DEFAULT_UNMODIFIABLE);
    } 
       
    /**
     * New pane for configuring the SentimentTagger node.
     */
    protected SentimentTaggerNodeDialog() {
        addDialogComponent(
    new DialogComponentBoolean(createSetUnmodifiableModel(),
          "Set named entities unmodifiable"));
    }
}

To make use of the specified settings in the NodeModel, a corresponding SettingsModel instance has to be created in the model and load, save and validate methods have to be called on it.

public class SentimentTaggerNodeModel extends NodeModel {
    
       private SettingsModelBoolean m_unmodifiableModel = SentimentTaggerNodeDialog.createSetUnmodifiableModel();
….
 
}

To save and load the settings the appropriate methods have to be called on the SettingsModel.

/**
 * {@inheritDoc}
 */
@Override
protected void saveSettingsTo(final NodeSettingsWO settings) {
    m_unmodifiableModel.saveSettingsTo(settings);
}
 
/**
 * {@inheritDoc}
 */
@Override
protected void loadValidatedSettingsFrom(final NodeSettingsRO settings)
        throws InvalidSettingsException {
    m_unmodifiableModel.loadSettingsFrom(settings);
}
 
/**
 * {@inheritDoc}
 */
@Override
protected void validateSettings(final NodeSettingsRO settings)
        throws InvalidSettingsException {
    m_unmodifiableModel.validateSettings(settings);
}

Since no view is implemented and thereby no view internal settings or data has to be saved or loaded the methods loadInternals(…) and saveInternals(…) don’t have to be implemented and are left empty.

/**
 * {@inheritDoc}
 */
@Override
protectedvoid loadInternals(final File internDir, final ExecutionMonitor exec) throws IOException,
        CanceledExecutionException {
    // Nothing to do ...
}
 
/**
 * {@inheritDoc}
 */
@Override
protected void saveInternals(final File internDir, final ExecutionMonitor exec) throws IOException,
        CanceledExecutionException {
    // Nothing to do ...
}

Tagger Model

In this section it is described how the model of the Sentiment Tagger is implemented. In the section above it was shown how the settings, specified in the node dialog are saved and loaded in the NodeModel. Now the methods configure(…) and execute(…), which need to be implemented in the NodeModel are explained.

Configuration

First it has to be specified how many input ports are required by the node and how many output ports are created. This can be done in the constructor of the NodeModel class. In this example one input data table is required, containing the input documents to tag and one output data table is created, containing the tagged documents. Therefore the super constructor needs to be called with parameters 1 and 1.

Before a node can be executed it has to be configured. This means that it needs to be checked, whether the input data table of the node contains all required columns, or not. To validate the type of the contained columns of the input data table the DataTableSpec has to be used, provided as a parameter of the configure(…) method. The KNIME Textprocessing plugin provides convenience methods to validate the structure of input data tables and create output data table, as well as their specification. Therefore the class DocumentDataTableBuilder should be used. In this example the Sentiment Tagger node requires one input data table containing one document column (documents to tag) and creates an output data table with one document column (tagged documents).

The following code listing shows how to implement the constructor of the SentimentTaggerNodeModel class, that specifies one input and one output port , as well as how to implement the configure(…) method, in order to check if the input data table contains at least one document column, and to create an appropriate output data table spec.

private int m_docColIndex = -1;
 
private DocumentDataTableBuilder m_dtBuilder;
 
/**
 * Creates new instance of {@link SentimetTaggerNodeModel} which adds
 * sentiment tags to terms of documents.
 */
public SentimentTaggerNodeModel() {
       super(1, 1);
       m_dtBuilder = new DocumentDataTableBuilder();
}
 
/**
 * {@inheritDoc}
 */
@Override
protected DataTableSpec[] configure(final DataTableSpec[] inSpecs)
             throws InvalidSettingsException {
       DataTableSpecVerifier verfier = new DataTableSpecVerifier(inSpecs[0]);
       verfier.verifyDocumentCell(true);
       m_docColIndex = verfier.getDocumentCellIndex();
       return new DataTableSpec[] { m_dtBuilder.createDataTableSpec() };
}

The DocumentDataTableBuilder, created in the constructor is used to verify the specification of the input data table in the second line of the configure(…) method. The method verifyDocumentCell(boolean) is used to check whether the input data table contains at least on document column or not. If true is set as parameter value of the method, an exception is thrown if no document column is contained, otherwise no exception is thrown. Furthermore the column index of the first document column of the input table is saved to the variable m_docColIndex. The index is needed later on when the documents are tagged. Last, the DocumentDataTableBuilder is used to create the specification of the output data table by the method call createDataTableSpec().

Tagging

Before the execute(…) method of the NodeModel can be implemented the Tagger class doing the actual tagging routine has to be implemented. This class, here named SentimentTagger is then used in the execute(…) method to tag the documents. The tagger class SentimentTagger extends the abstract class AbstractDocumentTagger. In the SentimentTagger class words are tagged as named entities, meaning that words of a sentence that are recognized by e.g. an underlying model are marked in a certain way. The abstract class AbstractDocumentTagger transforms marked words of the underlying document into terms and sets them unmodifiable. The unmodifiable flag has to be set in the constructor of the AbstractDocumentTagger. The following code listing shows the constructor of the SentimentTagger class.

/**
 * Creates new instance of {@link SentimentTagger} with unmodifiable flag to set.
 * @param setUnmodifiable The unmodifiable flag to set. If {@code true} tagged  
 * terms will not be modified by preprocessing nodes.
 */
public SentimentTagger(boolean setUnmodifiable) {
       super(setUnmodifiable);
}
 
/**
 * Creates new instance of {@link SentimentTagger} with unmodifiable and case sensitive flag to set.
 * @param setUnmodifiable The unmodifiable flag to set. If {@code true} tagged terms will not be modified by preprocessing nodes.
 * @param caseSensitive The case sensitive flag to set. If {@code true} tagging is applied case sensitive, otherwise not.
 */
public SentimentTagger(boolean setUnmodifiable, boolean caseSensitive) {
       super(setUnmodifiable, caseSensitive);
}

In addition to the unmodifiable flag a case sensitive flag can be set as well, specifying whether the transformation of marked words into terms has to be done case sensitive or not.

The SentimentTagger class has to provide the related tags for the marked words, which is done in the getTags(…) method. For the given string representation of a tag, the corresponding Tag instance must be returned. In this example the SentimentTagger class assigns tags of the custom SentimentTagSet, which is set as private member.

public class SentimentTagger extends AbstractDocumentTagger {
 
private SentimentTagSet m_tagSet = new SentimentTagSet();
...
 
 
/* (non-Javadoc)
 * @see org.knime.ext.textprocessing.nodes.tagging.AbstractDocumentTagger#getTags(java.lang.String)
 */
@Override
protected List getTags(String tag) {
// prepare list of tags related to given tag-string
      List tags = new ArrayList();
      tags.add(m_tagSet.buildTag(tag));
      return tags;
}
 
...
}

In the method List<TaggedEntity> tagEntities(Sentence) the actual tagging takes place. The only parameter is a sentence, containing the terms and words of the sentence to process. The method returns a list of TaggedEntity instances, representing the marked words, which are then transformed to terms in the underlying document by the AbstractDocumentTagger class.

In the following code listing words are marked randomly as tagged entities and sentiment tags are assigned also randomly. Of course here the recognizer model needs to be called, to perform the actual logic.

/* (non-Javadoc)
 * @see org.knime.ext.textprocessing.nodes.tagging.AbstractDocumentTagger#tagEntities(org.knime.ext.textprocessing.data.Sentence)
 */
@Override
protected List tagEntities(Sentence sentence) {            
       // prepare list of terms that have been tagged by the tagger model (tagged entities)
       List taggedEntities = new ArrayList();
 
       // in this example random tagging is applied
       Random r = new Random();
 
        // over all terms of sentence
        for (Term t : sentence.getTerms()) {
            // over all words of term
            for (Word w : t.getWords()) {
                double rD = r.nextDouble();
                // around 40% of the words are tagged randomly
                if (rD >= 0.6) {
                    // create tag value based on tagger model randomly
                    String tagValue = SentimentTag.NEUTRAL.toString();
                    if (rD <= 0.7) {
                           tagValue = SentimentTag.POSITIVE.toString();
                    } else if (rD <= 0.8) {
                           tagValue = SentimentTag.NEGATIVE.toString();
                    }
 
                    // create tagged entity of (word or words).
                    taggedEntities.add(new TaggedEntity(w.getWord(), tagValue));
                }
            }
        }
        return taggedEntities;
}

In this example only single words are marked as tagged entities. However, it is also possible to mark multiple words, such as “text mining”. The first parameter of the constructor of the TaggedEntity class is a string which can consist of concatenated multiple words. The words need to be separated by a single whitespace character, e.g.:

  new TaggedEntity(“text mining”, tagValue);    new TaggedEntity(w1.getWord() + “ “ + w2.getWord(), tagValue);

The second parameter is the string representation of the tag to assign. The method getTags(…) provides the AbstractDocumentTagger with the related Tag instance later on.

Furthermore preprocessing of the document can be applied before the tagging itself is processed. The preprocessing of the document can be done in the preprocess(…) method, which is called before the tagEntities(…) is called. In this example no preprocessing is necessary, thus the method is left empty.

new TaggedEntity(“text mining”, tagValue);
 
new TaggedEntity(w1.getWord() + “ “ + w2.getWord(), tagValue);

Execution

Now that the tagger class is implemented, the execute(…) method of the SentimentTaggerNodeModel can be implemented using an instance of SentimentTagger. The implementation is straight forward. First a new instance of SentimentTagger is created, then it is iterated over the input data table and tagging is applied to the document of the corresponding document column. The tagged document is added to the new output data table, which is returned in the end.

/**
 * {@inheritDoc}
 */
@Override
protected BufferedDataTable[] execute(final BufferedDataTable[] inData,
                    final ExecutionContext exec) throws Exception {
       // create tagger instance with unmodifiable setting
       DocumentTagger tagger = new SentimentTagger(m_unmodifiableModel.getBooleanValue()); 
 
       // prepare data table builder to be able to add tagged documents
       m_dtBuilder.openDataTable(exec); 
 
       // prepare row iterator and row counts
       RowIterator it = inData[0].iterator();
       int rowCount = inData[0].getRowCount();
       int currDoc = 1;
       while (it.hasNext()) {
              // check if process has been canceled
              exec.checkCanceled();
                    
             // set progress
             double progress = (double) currDoc / (double) rowCount;
             exec.setProgress(progress, "Tagging document " + currDoc + " of " + rowCount);
             currDoc++;
 
             // get current row
             DataRow row = it.next();
             // get document cell from row at index of document column
             DocumentValue docVal = (DocumentValue) row.getCell(m_docColIndex);
                    
             // tag document and add tagged document to data table builder
             m_dtBuilder.addDocument(tagger.tag(docVal.getDocument()));
       }
       // close data table builder and create output data table
       return new BufferedDataTable[] { m_dtBuilder.getAndCloseDataTable() };
}

Description

Finally describe the functionality of the node as well as the dialog parameter, as well as in and out ports in the SentimentTaggerNodeFactory.xml file.

Usage

This section shortly describes how to test if a custom tagger node has been properly integrated and can be used.

Therefore run KNIME out of the KNIME SDK. Create a new workflow and apply the Sentiment Tagger node to tag a list of documents. Then use the BoW node to create a bag of words. The assigned tags can be seen in the term column.

The following picture shows a part of a workflow containing the SentimentTagger and BoW nodes. The dialog of the SentimentTagger node is opened, as well as the output data table of the BoW node. It can be seen that some terms, such as “what”, “a”, or “German” have tags assigned, based on a random tagging.