This workflow can be found on the KNIME Workflow Public Server under 50_Applications/13_Address_Deduplication/01_Deduplication_of_Address_Data50_Applications/13_Address_Deduplication/01_Deduplication_of_Address_Data*
There is really no limit to people’s creativity when making typos! And this is the reason why string matching is such an important part of data analytics. Postal addresses, restaurant names, customer references they all rely on correct spelling. Since this is often not the case, we need to turn to string matching algorithms (see Overview of Record Linkage and Current Research Directions for details) to find out the closest string to what the user wrote. String matching algorithms are based on string distances between our current possibly misspelled data records and a reference data set.
For this use case we use the "Restaurant data set" which contains 864 restaurant records and 112 duplicates with typos in the restaurant name and/or address. Each record consists of name, address, city, type, and class. Records with identical class represent the same restaurant.
Let’s suppose restaurants have very similar but not identical names, like “felidia” and “felida”. They probably refer to the same restaurant with just a minor typo. In this case, the absolute Levenshtein distance, calculated between the two names in the Similarity Search node, shows that the number of different characters in the two spelling arts (i.e. =1) is not that big and the two names might indeed refer to the same restaurant. A number of other string distances are available in the String Distance node, each one with different features and fitting different problems.
Maybe checking the name is not enough. Maybe checking the name AND the address offers a better bullet-proof criterion. The String Distance node combined with the Aggregated Distance node calculate complex distances on one or more attributes or using one or more distance functions. The String Distance node calculates one simple distance on one single attribute. The Aggregate Distance node combines two simple distances together. In the following workflow we built a distance as:
A cascade of such blocks can build arbitrarily complex distance functions for string comparison.
The problem of string matching (or address deduplication) is now reduced to the definition of a reference list of correct strings, in our case names and addresses.
This workflow makes use of the following extensions:
- KNIME Distance Matrix
* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)