Authors: Vincenzo Tursi, Kathrin Melcher, Rosaria Silipo
Remember Emil the Teacher Bot? Well, today we want to talk about how Emil’s brain was created! The goal when creating Emil’s brain was to enable him to be able to associate the most suitable resources on the KNIME website with keywords from the questions he is asked.
Before we continue, let’s recap what we’ve talked about so far in this Teacher Bot series:
- The first blog post “Emil the Teacher Bot” describes how the workflow behind Emil assembles a web browser based GUI that has text processing and keyword extraction capabilities with machine learning algorithms.
- The second post “Keyword Extraction for Understanding” discussed automatic keywords extraction algorithms available in KNIME Analytics Platform and our choice criterion .
- The third post in the series “An Ontology for Emil” stressed the importance of defining a class ontology before starting a classification project.
As we’ve shown in these previous blog posts, our source was the questions from the KNIME Forum posted between 2013 and 2017. Questions were imported and stored as unanswered, as only a few answers contained links to educational resources and only some of those links referred to up-to-date educational material. At the same time, we also adopted a class ontology with 20 classes.
Emil’s brain’s goal then became two fold:
- Associate the right class from the ontology to the keywords summarizing each question.
- Within each predicted class, explore educational resources on KNIME site and extract the top four most relevant ones.
Today, let’s concentrate on goal # 1: Associate the right class from the ontology to the input question; i.e. the keywords summarizing the question.
The Classification Problem
Seemingly a simple problem to solve, the difficulty lies in the dataset being unlabeled. Without class labels we cannot implement any kind of supervised learning. At this point, we could have opted for an unsupervised algorithm, hoping to match the final clusters with the ontology classes, or we could have enforced a supervised training by using the active learning technique.
As unsupervised learning requires defining an appropriate number of clusters and, by definition of unsupervised, does not even try to match the desired classes, we leaned towards the second option: the supervised learning plus active learning technique.
A third option of course could have been to manually label the dataset, but that is expensive and time consuming. Given the budget constraints, this was not a viable option for this project.
The Active Learning Cycle
This cycle starts with a labeled training set. At first, correct labeling is not important. The main requirement is that the dataset is labeled at all.
The Initially Labeled Dataset
We calculated the N-gram Tversky distance between the keywords extracted from each question and the keywords extracted from each tutorial page on the KNIME site. The class of the closest tutorial page was subsequently assigned to the question. This thus labeled dataset was our starting point. The workflow “02_AL_First_Try_Assign_Classes_via_Distance” is available on the EXAMPLES Server at 50_Applications/33_Emil_the_TeacherBot/50_Applications/33_Emil_the_TeacherBot/* . Based on this “mislabeled” dataset, we trained the supervised model of choice: a Random Forest with 100 trees.
Detecting the Frontier Points
After training the model, a criterion is needed to identify those points at the frontier of the class groups.
We assumed that the most uncertain predictions would refer to the frontier points. Uncertain predictions are those predictions where the highest probability of the predicted class is not very distant from the second highest probability of the second predicted class. Quantifying: if the difference between the two highest class probabilities produced by the random forest for a given question is lower than 0.2, the predicted class is considered “uncertain”. At the end of each training procedure, we identified the top 10% of questions classified as uncertain. These are the frontier questions across classes. The “03_AL_Training_Subset_Uncertain_Classes” workflow is available on the EXAMPLES Server at 50_Applications/33_Emil_the_TeacherBot/02_ActiveLearning50_Applications/33_Emil_the_TeacherBot/02_ActiveLearning* .
Relabeling the Frontier Points
These top 10% of questions classified as uncertain were then manually relabeled by one of our team mates, represented by “The Thinker” in Figure 1. The workflows, “04_AL_Re-label_Uncertain_Classes” and “05_AL_Labeling” handled this relabeling. These workflows are also available on the KNIME EXAMPLES Server at 50_Applications/33_Emil_the_TeacherBot/02_ActiveLearning50_Applications/33_Emil_the_TeacherBot/02_ActiveLearning* .
Extending Labels to Neighbor Points.
Labeling is then extended to the closest neighbor points and from there to the neighbors of the neighbors using a k Nearest Neighbor (k-NN) algorithm. The workflow that takes care of this is “06_AL_Extend_Expert_Classes_with_kNN”. It’s available here on the EXAMPLES Server: 50_Applications/33_Emil_the_TeacherBot/02_ActiveLearning50_Applications/33_Emil_the_TeacherBot/02_ActiveLearning* .
Taking this freshly relabeled training set, we retrain our supervised model. Again, the top 10% questions with most uncertain predictions are extracted and manually re-labeled by the thinking team mate. This cycle is repeated.
The active learning cycle is shown in Figure 1.
Figure 1. The Active Learning cycle used to build Emil the Teacher Bot’s brain.
The random forest model was trained with a Random Forest Learner node on the entire dataset. A Math Formula node, within a loop, computed the difference between the three highest class probabilities for each question. Finally, the subset of the top 10% questions classified as most uncertain (i.e. with the closest distance between the top three classes) was extracted. This subset was used as input for the manual relabeling.
GUI for Manual Relabeling
Two web GUIs were employed to manually re-label the extracted questions on the KNIME WebPortal:
- one web page to confirm or reject the most likely class as the question label;
- in case of rejection, a web page to manually insert the new required labels.
The first web page, called “Category”, which prompts you to confirm or reject the top predicted class as the question label, is shown in Figure 2a; the corresponding sub-workflow, in the Choose Answer metanode, is in Figure 2b.
Figure 2a. Web page generated by the Choose Answer wrapped metanode. This page allows you to confirm or reject the most likely class as the question label. Selecting the Something Else option defers the class assignment to the next step. The I’m fed up button stops this labeling process at any point.
The second web page assigns new labels to unlabeled questions is shown in Figure 3a; the corresponding sub-workflow, in the wrapped metanode called Labeling, is shown in Figure 3b.
Figure 3a. Web page generated by the Labeling wrapped metanode in order to manually insert new labels.
Figure 3b. The Labeling wrapped metanode. The wrapped metanode allows you to manually insert new labels. The Text Output nodes and the Table Editor (Java Script) node serve to generate the web page.
Note. The sub-workflow in Figure 3b, generating the page in Figure 3a, contains a Table Editor node. The Table Editor node allows editing of the content of the table cells.
The two workflows implementing this part of the active learning cycle are called “04_AL_Re-label_Uncertain_Classes” and “05_AL_Labeling” and are available on the EXAMPLES Server here: 50_Applications/33_Emil_the_TeacherBot50_Applications/33_Emil_the_TeacherBot*
The Dreadful Questions
Now the dreadful questions need to be addressed that come at the end of each project.
- How many iterations do you need to reach a reasonably labeled training set?
- How did class distribution in the dataset develop across iterations?
- And, finally, did it work? Did we actually reach a reasonably labeled training set to produce a model with sufficient performance?
How many active learning iterations do you need?
This question cannot be answered precisely. It depends too heavily on the data and on the classification problem. Usually, active learning iterations are run until no more changes can be observed in the class distribution on the training set. In this project, three iterations were sufficient to reach a stable class distribution on the training set. Which leads to our next question.
How did the class distribution change across iterations?
At each iteration, class distribution on the training set was represented by a word cloud. Words are the class names. Word size is the number of points assigned to that class. You can see the word clouds for iteration 0 (starting point), 1, and 2 in Figure 4.
As a starting point, we assigned classes to questions on the basis of the N-gram Tversky smallest distance between the closest tutorial page and the question. Since there are lots of pages on the KNIME site that describe how to install KNIME Analytics Platform, the most frequently assigned class to the forum questions was “Installation”. This is also the dominant word in the word cloud at iteration # 0.
After the first iteration (iteration # 1) of manual labeling and k Nearest Neighbor assignment, the most frequent class was “ETL”, which makes more sense. The majority of questions on the KNIME Forum refer to ETL, i.e. data manipulation, operations.
The situation did not change much after iteration # 2. That is why we stopped the active learning cycle here.
Figure 4. Word cloud of class frequency in training set at iteration 0, 1, and 2 of the active learning cycle.
And, finally: Did it work?
It is hard to define a metric for success here. The usual sensitivity, specificity, accuracy, Cohen’s Kappa measures won’t work, as we are referring to a dataset that is potentially wrongly labeled. A 90% accuracy on a faulty test set only tells us that the model has learned well to generalize possibly faulty rules on unseen data.
Indeed, in our project, accuracy on the test set decreased with the progressing of the active learning cycle, which could be a very discouraging factor. Let’s walk through the active learning cycle using one single question as an example. The example question is:
The right class would be “Data Access” and the answer should lead to database tutorials.
After iteration 0, the top two proposed classes are “Installation” and “ETL”. In phase two, the tutorial pages on the KNIME site with the closest distance to the original question end up being “The Workflow Coach” and “The EXAMPLES Server”. These pages are not necessarily wrong answers, since the Workflow Coach and the EXAMPLES server cover a bit of all topics, however, they are also not very specific.
After iteration 1, the proposed predicted class, i.e. the output class with highest probability, is “Data Access”. This leads us to the tutorial pages “Twitter meets PostgreSQL” and “Blending Databases. A Database Jam Session”. Both answers would definitely be more pertinent to the question than what is proposed in iteration 0.
After iteration 2, the proposed output class and final proposed tutorial pages have not changed.
While it is hard to quantify success in this case, it is easy to see that, at least for this single question, the final answer becomes more correct despite a decreasing accuracy. The same happened for a number of other questions we inspected.
Note. Accuracy is not the only measure to a successful project!
We have shown here how to train a supervised model with iteratively adjusted supervision, via an active learning cycle. The idea behind active learning is that even if you do not have a reliably labeled dataset, you can still start from whatever dataset you have and iteratively and partially relabel it, with the aim of improving the quality of the dataset through the extension process to the neighboring points.
Note. A common starting point for an active learning procedure consists of manually labeling a few random records from the most populated regions. This improves the quality of the dataset from iteration 0 already.
The usage of web pages on the KNIME WebPortal for confirming/rejecting labels and editing new ones has made human intervention within the cycle easier and more efficient.
This active learning procedure has allowed us to train Emil’s brain with supervision, even though a set of labeled questions was not available.
Hi, I am Emil and I can find out which tutorial pages best answer your question.
* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)