KNIME for Finance: Fraud detection using distributions

We have covered several advanced techniques such as Isolation Forest and DBSCAN for detecting outliers in a dataset. Now, for our final technique in our fraud detection series, we will focus on a statistical approach for detecting fraudulent transactions: the distribution method.

The distribution method identifies outliers by analyzing how data points deviate from a known distribution, making it particularly effective in detecting anomalies like fraudulent credit card transactions, where most data follows predictable patterns.

Unlike isolation forest, which detects anomalies through random partitions, the distribution method uses a statistical method to flag deviations from the norm. This method is a straightforward and intuitive process, particularly when deployed in environments like KNIME Analytics Platform, which offers built-in support for implementing this method.

This is part of a series of articles to show you solutions to common finance tasks related to financial planning, accounting, tax calculations, and auditing problems all implemented with the low-code KNIME Analytics Platform.

With KNIME’s low-code, visual interface, you can create workflows that detect fraudulent transactions by identifying deviations from expected data patterns—without needing extensive pre-processing or external tools like Python

In the following sections, we'll walk through how to apply the Distribution Method using KNIME Analytics Platform, highlighting the steps required to train and deploy the model for flagging outliers.

What is the distribution approach?

The Distribution Approach, similar to Quantiles, relies on statistical methods to detect outliers. Unlike our previous methods that use some of the more advanced techniques such as clustering or the traditional machine learning approach, this method is simpler and more efficient for identifying outliers. It focuses on how data points align with or deviate from an assumed statistical distribution, typically a normal (Gaussian) distribution. The assumption is that most legitimate transactions follow a predictable pattern, while outliers—potential fraudulent transactions—deviate from this pattern.

This method works by analyzing the variance in the data and selecting transactions that fall outside expected ranges. Z-score normalization is commonly used which transforms the data so that each feature has a mean of zero and a standard deviation of one. This makes it easier to compare features on different scales, and detect anomalies. Transactions with Z-score far from the mean are likely to be outliers and may indicate fraud.

Confidence intervals help by setting thresholds for normal behavior. For example, a 95% confidence interval assumes that 95% of the data points fall within a certain range, and anything outside that range could be flagged suspicious. The system can then label transactions as normal or anomalous depending on how much they deviate from expected statistical patterns.

The Distribution Approach, when applied to fraud detection, enables us to flag transactions that deviate from expected patterns based on statistical distributions. This approach is particularly useful when working with highly imbalanced datasets, such as the Credit Card Fraud Detection dataset. By identifying transactions that lie outside the normal distribution range, we can identify potential fraud while also keeping a low false-positive rate.

Identify fraudulent transactions with distributions

Credit card transactions can generally fall into two categories: normal and suspicious. The goal is to accurately identify fraudulent transactions while minimizing the number of false positives—cases where legitimate transactions are incorrectly flagged as fraudulent. Ideally, only a small percentage of flagged transactions should turn out to be false positives.

In our use case, we will focus on automating fraud detection by training a model on a labeled dataset and applying it to a new transaction to simulate incoming data from an outside data source.

For this, we will use the popular Credit Card Fraud Detection dataset from Kaggle. This dataset consists of real credit card transactions made by European cardholders in September 2013, with the transaction details anonymized for privacy. It includes a total of 284,807 transactions over a two-day period, of which 492 are fraudulent transactions. The dataset represents a severe class imbalance between the ‘good’ (0) and ‘frauds’ (1), where ‘frauds’ account for only 0.172% of the data.

The dataset contains 31 columns:

V1 - V28: These are numerical input variables from a PCA (Principal Component Analysis) transformation
Time: This column represents the time in seconds elapsed from current transaction to first transaction
Amount: The monetary value of the transaction
Class: This is the target variable, where ‘1’ indicates a fraudulent transaction and ‘0’ indicates a normal (non-fraudulent) transaction.

A key feature for our training our model is the “Class” variable as it helps us to evaluate the performance of the algorithm on the dataset.

The process for creating our classification model is outlined below. Even when handling data from multiple sources, the core steps remain consistent:

Create/import a labeled training dataset
Z-score normalize the data

Train the model
Evaluate model performance
Import the new, unseen transactions
Deploy the model and feed the new transactions in
Notify if any fraudulent transactions are classified

Using statistics to identify fraudulent transactions

All workflows used in this article are available publicly and free to download on the KNIME Community Hub. You can find the workflows on the KNIME for Finance space under Fraud Detection in the Distribution Based section.

The first workflow focuses on training our Distribution model. You can view and download the training workflow Distribution Training from the KNIME Community Hub. With this workflow you can:

Read training data from a specified data source. In our case, we use data from the Kaggle dataset previously mentioned.
Data Preprocessing by renaming the target column from 0 or 1 to “good” or “bad”
Normalize the Data using z-score normalization.
Filter out only feature v5 by using the ‘Column Filter’ node and remove tails
Mark outliers where “NOISE” indicates potential fraudulent transaction
Evaluate model results by opening the view of the Scorer node to check overall accuracy of the model.
Save the Normalization (PMML) Model to be used in the deployment workflow

In our second workflow, Distribution Deployment, you can perform the following steps:

Read the Normalization PMML model and new data for classification
Apply the Normalizer using the PMML model to the incoming transaction/new data
Exclude distribution tails and outliers
Mark outliers or classify the new transaction by assigning labels
Send an Email to notify if a transaction is fraudulent

Inside the ‘Send Email’ component, we evaluate whether the transaction has been classified as fraudulent or not. If it is indeed flagged as fraudulent, an email is sent to the specified person for a further follow up.

A statistical method for classifying transactions

In our training workflow, opening up the ‘Scorer’ node gives us the confusion matrix below.

We have a confusion matrix that summarizes the performance of the Distribution method. While it doesn’t perform as good as some of our previous methods such as Isolation Forest or DBSCAN which achieved around 97% - 98% accuracy, it still performs impressively for a purely statistical method. Considering that previous methods were more advanced, this result is noteworthy. The key comparison here is between the Distributions method and our other statistical approach— Quantiles.

Below, you will find another confusion matrix but from using Quantiles.

Quantiles yielded around 84% accuracy but almost 10% lower than Distributions which achieved about 94%. In fraud detection, this difference is significant, and it’s clear that Distributions outperformed Quantiles in terms of accuracy.

One reason for this result is that Quantiles are more effective with data that is linearly distributed or evenly spread across a range. In contrast, our dataset is the opposite and is not linearly distributed as we have a high amount of normal points and a small amount of fraudulent points. In our context,Distributions is better suited to handle this non-linear, imbalanced data, which explains its higher accuracy.

KNIME for Finance

KNIME Analytics Platform provides a simple and intuitive setup to implement statistical methods like the Distribution Approach for fraud detection. With KNIME, users can seamlessly integrate these techniques into their workflows using a low-code, visual interface. By leveraging built-in KNIME Extensions, even advanced statistical methods, like the Distribution Approach, can be applied with minimal coding, allowing users to focus on solving real-world problems. In this article, we explored how Distributions can be used to detect anomalies in financial datasets.

This is the last technique in our series on Credit Card Fraud Detection. Feel free to check out our previous methods covered in the KNIME Blog.

hub