I am trying to learn KNIME and do a text classifier. I am going through the Document Classification example and using that to build my own. I have a CSV file for my data that has one column for text and another for the category.
The example uses the Table Reader nodes to read in data. I took the concatenated result of that model and exported it to a CSV so that it mimcs my data. Everything is built the same and I get identical results up to the Term FIltering node. Basically, instead of using the Table Reader I read the exact same data via File Reader. However, when I get to the Bag of Words part in the Term filtering, I suddenly get different results. The example model contains many more rows of output. Is there something I'm not accounting for? All of the settings are identical but the results are the vastly different. Is the Strings To Document tokenizing the CSV entries differently? I just used the default tokenizer.
this is really difficult to diagnose without actually looking at your workflow.
Here is my workflow. Maybe I'm doing it wrong. I'm just learning this software but this looked fairly straight forward.
As an example, when I run up to the bag of words creator, I noticed that the output table had 25,009 rows. Using the same exact settings, I could only get 1,726 with my CSV import.