Outlier Detection in Medical Claims

This workflow can be found on the KNIME EXAMPLES Server under 50_Applications/14_Medical_Claims/01_Interactive_Outlier_Detection50_Applications/14_Medical_Claims/01_Interactive_Outlier_Detection*

The workflow demonstrates the detection of outliers in a public available set of medical claims based on statistical measures. However the demonstrated techniques are not specific for medical data but can be applied to other areas as well.


The goal of the workflow is to identify outliers such as claims with an unusual high cost for a specific disease. In order to find those outliers we group the input data by the target variable (e.g. disease) and compute the mean and standard deviation for the numerical variable in question (e.g. cost of stay). Outliers are all records that deviate more than x*standard deviation from the mean value of the group they belong to. The factor x is specified by the analyst, e.g. 2.


The upper branch of the workflows identifies outliers for one target variable as described above. The user can change the group and aggregation column via the metanode context menu. The lower branch of the workflow is a refinement of this approach and allows the identification of outliers across several variables e.g. claims with an unusual high cost for a certain disease and duration of stay. To achieve this the user has to select two group columns such as disease and duration of stay.

Data Set

The data set is the public available Basic Stand Alone (BSA) Inpatient Public Use Files (PUF) named “CMS 2008 BSA Inpatient Claims PUF”. The file contains Medicare inpatient claims from 2008. Each record is an inpatient claim incurred by a 5% sample of Medicare beneficiaries. The file contains seven (7) variables: A primary claim key indexing the records and six (6) analytic variables. One of the analytic variables, claim cost, is provided in two forms, (a) as an integer category and (b) as a dollar average. These two versions are essentially equivalent. As they can be treated as one variable, there are six (6) rather than seven (7) analytic variables, in addition to the claim ID. There are some demographic and claim-related variables provided in this PUF. However, as beneficiary identities are not provided, it is not possible to link claims that belong to the same beneficiary in the CMS 2008 BSA Inpatient Claims PUF.

This workflow makes use of the following extensions:

  • KNIME Nodes to create KNIME Quick Forms (dark green nodes within the metanodes)
  • KNIME Professional extension: Linked Metanodes (optional)


* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)


What are you looking for?