Outlier Techniques for Credit Card Fraud Detection

Based on an August 2020 report by Interpol, more people have been spending time online since the start of the coronavirus pandemic, which has resulted in increased cybercrime. UK Finance also claimed remote banking fraud losses soared by 21% to reach 80 million pounds during the first half of 2020. Online fraud is on the rise.

ING reported the following in their 2019 Annual Report: “Technology can help us deal with these risks by improving customer due diligence processes and the prevention, detection, quality and speed of response to financial economic crime. For example, we developed a virtual alert handler that uses artificial intelligence to better detect suspicious transactions and customer behaviors, … An AI-based anomaly detection tool went live in September, which is used to uncover suspicious transactions in the clearing and settlement process between banks.”

Criminals are persistently finding new ways to circumvent measures put in place to prevent fraud, making their activities difficult to detect. Thus, it is difficult for advanced machine learning models to be trained from available data because fraudulent patterns are not only scarce but also change rapidly. Below, we explore how data science can be used to identify credit card fraud.

Our case study draws on the Credit Card Fraud Detection Kaggle dataset which contains more than 284,000 credit card transactions performed by EU cardholders in September 2013. Each transaction is described by 30 features: 28 principal components pulled from the original data, the transaction amount, and the date. The 28 principal components hide sensitive cardholder data. Each transaction is assigned a class label: legitimate (0) and fraudulent (1).

Most of the transactions in the dataset are legitimate; only a very small portion — 492, or 0.2% — are fraudulent. This disproportion does not allow us to train supervised machine learning models successfully, given the limited number of available examples for the fraud class. If we consider that most datasets with credit card transactions are unlabeled or that fraudulent transactions cannot be reliably manually identified, the chances of applying a supervised machine learning model successfully dwindle further. This is where outlier detection techniques can be useful. Here, we experiment with several different outlier detection techniques.

Quantile-based: Box plot
Distribution-based: Z-score
Cluster-based: DBSCAN
Neural autoencoder
Isolation forest

Of the 492 fraudulent transactions, we used 80 in a validation set to optimize the parameters involved in the techniques, such as thresholds, and 20 as part of a test set. A corresponding number of the more numerous legitimate transactions was added to both validation and test sets. Performances, in terms of Recall and Precision on the test set, as shown below in the graphic.

Fig. 1. Recall and Precision measured on the test set for the outlier detection techniques described above.

The number of false positives is incredibly high for the first two techniques, box plot and z-score, as seen from their Precision percentage. This means that to detect some 60% of fraudulent actions, most transactions are labeled as fraud alarms. Not very useful!

Download the workflow, Overview of Credit Card Fraud Detection Techniques, from the KNIME Hub. It demonstrates different fraud detection techniques and evaluates the performance of each one on the same test set, reported in terms of Recall and Precision.

This workflow is stored in the Finance, Accounting, and Audit space on the KNIME Hub, which contains many more solution starters for you to get an idea about the capabilities KNIME provides in these fields.