Deduplication of Address Data

The workflow shows the power of the new distance measurement framework - a high prediction correctness of possible matches is achieved with a minimum number of nodes and without any preprocessing by just aggregating some distances on different attributes. The chosen data set is the "Restaurant data set" from http://www.cs.utexas.edu/users/ml/riddle/data.html comprising 864 restaurant records and 112 duplicates. Each record contains a name, an address, a city, a type and finally a class attribute.

01 Deduplication of Address data

An address data set with duplicates is matched via similarity search on a reference address data set. To each row of the address data set the row of the reference data set with the minimum distance is assigned. The distance between two address rows is the mean of the 2-gram Dice distances of the 'name' and 'address' columns.

Subscribe to Address Deduplication