In todays blog post I want to explore some different approaches to dealing with missing values in data sets in the KNIME Anlaytics Platform. Missing data is a problem that most people have to deal with at some point, and there are different approaches to doing so.
Data Enrichment, Visualization, Time Series Analysis, Optimization
There has been a lot of talk about the Internet of Things lately, especially since the purchase of Nest by Google, officially opening the run towards intelligent household systems.
Intelligent means controllable from a remote location and capable of learning the inhabitants’ habits and preferences. Companies working in this field have multiplied over the last few years and some of them have been acquired by bigger companies, like SmartThings by Samsung for example.
Deduplication is the process of identifying redundant records in a data set referring to the same real-world entity and subsequently merging these together. Address data sets often contain slightly different records that represent identical addresses or names. Names of persons, streets, or cities may be written differently, are abbreviated, or misspelled. For example consider the following two addresses:
- Muller Thomas, Karl-Heinz-Ring 3, 80686, Allach
- Mueller Tomas, Karl-Heinz-Ring 3, 80686, Munich Allach
To deduplicate address data sets the records can be matched on a reference address data set in order to normalize their name and address notations.
With KNIME 2.10 new distance nodes have been released that allow the application of various distances measures in combination with the clustering nodes k-Medoids and Hierarchical Clustering, the Similarity Search node, and the Distance Matrix Pair Extractor node. Besides numerical distances such as p-norm distances (Euclidean, Manhattan, etc.), or cosine distance, also string distances, and byte and bit vector distances are provided. On top distances can be aggregated. If you still can't find the distance function you are looking for you can easily implement a customized distance with only one or two lines of Java code, using the Java Distance node. To get the nodes that make use of the new distances install the "KNIME Distance Matrix" extension (KNIME & Extensions -> KNIME Distance Matrix).
Familiar from common applications such as demographic analysis, visualizing election results, and mapping the outbreaks of disease, choropleths are an important visualization technique for aggregating spatial data. Today I want to discuss some techniques for generating such graphics in KNIME. So, to get us started, lets look at a very simple workflow to build such a visualisation.
If you are using the KNIME Server with the KNIME WebPortal in your organization you can easily share published workflows with your colleagues by using the URL parameters of the WebPortal. Not only is this a great way of linking directly to a specific workflow, but it also gives you the possibility to embed the WebPortal somewhere else in your corporate environment.
This powerful feature was introduced with KNIME Server 3.7 and some enhancements were added with version 3.8.
The new KNIME Twitter nodes allow you to search for Tweets on Twitter, retrieve information about users, post Tweets via KNIME and much more.
Programming is fun. At least many aspects of programming. What's usually not considered funny is writing documentation and... testing. For the former I agree, but I will show you that the latter can also be fun. Even (or especially) with KNIME, and even more with some of the nice additions to the testing framework in 2.10.
KNIME has made a strong case for openness: deliver a complete platform that’s free and open source; then make money by adding functionality to make that platform easier and more efficient to use. But how does that work in reality? The just-released KNIME version 2.10, complemented with the next-generation commercial extensions to the KNIME Analytics Platform, gives us a taste of what they’re cooking up for us.