This example shows one way of anonymizing data. it uses the approved adults data set. For this example, distance matrix are calculated for all relevant rows then k-nearest Neighbors is used to find the "closest" by default 2 records to the original. A record to replace the original is then built by randomly choosing values from the closest neighbors. To test the anonymized data, a standard machine learning excersize is performed on the anonymized data, the original data and also by applying the anonymized model to the original data. Measures of quality are captured.
This workflow is not a functioning workflow but instead a conceptual workflow designed to show how each of the GDPR example workflows can be used in a production workflow. This workflow dows show an interesting technique for for explaining workflows, processes and concepts to individuals who do not know KNIME. The majority of metanodes are "empty" and simply connect the input port to the output ports. The metanodes are documented to show a major step or concept. By starting with a blank table create node in the first metanode, you can actually "execute" the workflow.
This workflow consolidates all information and creates a BIRT reporf from the workflows: 0 GDPR Examples Overview 1 Identify PII and Special Category Data 2 Anonymize personal data 3 Explain Model. Use your favorite tool or package to consolidate the information and make it understandable. You may have a version for: - The Legal Team - The Data Protection Officer - The individual who's data is being used.... THIS is probably the most important workflow of all! We capture the decisions we have made and what we have done but we EXPLAIN it as well.
For building, using and testing GDPR Metanodes, youl will need to create data that actually shows the required conditions. In our case, we are interested in personal data, discriminatory fields and pseudo discriminatory fields. To create such data, we use the classic adults.csv dataset. Each row represents an individual who is annonymous. we first add a unique identifier. There is already one "special category" field - race.
This example shows one method for determing relative "importance" of each Feature based on the algorythm you have chosen for your model. Many other options and approaches are available depending on the machine learning technique you use.. In this case, an example is shown using forward feature addition around a random forrest, since that was the method chosen for the prediction problem. For further information, please refer to the white paper "Taking a proactive approach to GDPR with KNIME"