As first published in The New Stack.

Cohen’s kappa is a metric often used to assess the agreement between two raters. It can also be used to assess the performance of a classification model.

For example, if we had two bankers, and we asked both to classify 100 customers in two classes for credit rating, i.e. good and bad, based on their creditworthiness, we could then measure the level of their agreement through Cohen's kappa.

Similarly, in the context of a classification model, we could use Cohen’s kappa to compare the machine learning model predictions with the manually established credit ratings.

Like many other evaluation metrics, Cohen’s kappa is calculated based on the confusion matrix. However, in contrast to calculating overall accuracy, for example, Cohen’s kappa takes imbalance in class distribution into account and can therefore be more complex to interpret.

**In this article we will:**

- Guide you through the calculation and interpretation of Cohen’s Kappa values, particularly in comparison with overall accuracy values
- Show that where overall accuracy fails because of a large imbalance in the class distribution, Cohen’s kappa might supply a more objective description of the model performance
- Introduce a few tips to keep in mind when interpreting Cohen’s kappa values!

# Measuring Performance Improvement on Imbalanced Datasets

Let’s focus on a classification task on bank loans, using the German credit data provided by the UCI Machine Learning Repository. In this dataset, bank customers have been assigned either a “bad” credit rating (30%) or a “good” credit rating (70%) according to the criteria of the bank. For the purpose of this article, we exaggerated the imbalance in the target class credit rating via bootstrapping, giving us 10% with a “bad” credit rating and 90% with a “good” credit rating: a highly imbalanced dataset. Exaggerating the imbalance helps us to make the difference between “overall accuracy” and “Cohen’s kappa” clearer in this article.

Let’s partition the data into a training set (70%) and a test set (30%) using stratified sampling on the target column and then train a simple model, a decision tree, for example. Given the high imbalance between the two classes, the model will not perform too well. Nevertheless, let’s use its performance as the baseline for this study.

## Baseline model

In figure 1 you can see the confusion matrix and accuracy statistics for this baseline model. The overall accuracy of the model is quite high (87%) and hints at an acceptable performance by the model. However, in the confusion matrix, we can see that the model is able to classify only 9 out of the 30 credit customers with a bad credit rating correctly. This is also visible by the low sensitivity value of class “bad” - just 30%.

Basically, the decision tree is classifying most of the “good” customers correctly and neglecting the necessary performance on the few “bad” customers. The imbalance in the class a priori probability compensates for such sloppiness in classification. Let’s note for now that Cohen's kappa value is just 0.244, within its range of [-1,+1].

## Improved model

Let’s try to improve the model performance by forcing it to acknowledge the existence of the minority class. We train the same model this time on a training set where the minority class has been oversampled using the SMOTE technique, reaching a class proportion of 50 % for both classes.

To provide more detail about the confusion matrix for this model, 18 out of the 30 customers with a “bad” credit rating are detected by the model, leading to a new sensitivity value of 60% over the previous 30%. Cohen’s kappa statistics is now 0.452 for this model, which is a remarkable increase from the previous value 0.244. But what about overall accuracy? For this second model it’s 89%, not very different from the previous value 87%.

When summarizing we get two very different pictures. According to the overall accuracy, model performance hasn’t changed very much at all. However, according to Cohen’s kappa a lot has changed! Which statement is right?

From the numbers in the confusion matrix, it seems that Cohen’s kappa has a more realistic view of the model’s performance when using imbalanced data.

**Why does Cohen’s kappa take more notice of the minority class? How is it actually calculated? Let’s take a look!**

# Cohen’s kappa

Cohen’s kappa is calculated with the following formula [1]:

where *p*_{0 }is the overall accuracy of the model and *p _{e}* is the measure of the agreement between the model predictions and the actual class values as if happening by chance.

In a binary classification problem, like ours, *p _{e}* is the sum of

*p*the probability of the predictions agreeing with actual values of class 1 (“good”) by chance, and p

_{e1},_{e2}, the probability of the predictions agreeing with the actual values of class 2 (“bad”) by chance. Assuming that the two classifiers - model predictions and actual class values - are independent, these probabilities,

*p*and

_{e1}*p*, are calculated by multiplying the proportion of the actual class and the proportion of the predicted class.

_{e2}Considering “bad” as the positive class, the baseline model (Figure 1) assigned 9% of the records (false positives plus true positives) to class “bad”, and 91% of the records (true negatives plus false negatives) to class “good”. Thus p_{e} is:

And therefore Cohen’s kappa statistics:

which is the same value as reported in figure 1.

Practically, Cohen’s kappa removes the possibility of the classifier and a random guess agreeing and measures the number of predictions it makes that cannot be explained by a random guess. Furthermore, Cohen’s kappa tries to correct the evaluation bias by taking into account the correct classification by a random guess.

# Pain Points of Cohen’s Kappa

At this point, we know that Cohen’s kappa is a useful evaluation metric when dealing with imbalanced data. However, Cohen’s kappa has some downsides, too. Let’s have a look at them one by one.

## Full range [-1, +1], but not equally reachable

It’s easier to reach higher values of Cohen’s kappa, if the target class distribution is balanced.

For the baseline model (Figure 1), the distribution of the predicted classes follows closely the distribution of the target classes: 27 predicted as “bad” vs. 273 predicted as “good” and 30 being actually “bad” vs. 270 being actually “good”.

For the improved model (Figure 2), the difference between the two class distributions is greater: 40 predicted as “bad” vs. 260 predicted as “good” and 30 being actually “bad” vs. 270 being actually “good”.

As the formula for maximum Cohen’s kappa shows, the more the distributions of the predicted and actual target classes differ, the lower the maximum reachable Cohen’s kappa value is. The maximum Cohen’s kappa value represents the edge case of either the number of false negatives or false positives in the confusion matrix being zero, i.e. all customers with a good credit rating, or alternatively all customers with a bad credit rating, are predicted correctly.

where *p _{max}* is the maximum reachable overall accuracy of the model given the distributions of the target and predicted classes:

For the baseline model, we get the following value for *p _{max}*:

Whereas for the improved model it is:

The maximum value of Cohen’s kappa is then for the baseline model:

For the improved model it is:

As the results show, the improved model with a greater difference in the distributions between the actual and predicted target classes can only reach a Cohen's kappa value as high as 0.853. Whereas the baseline model can reach the value 0.942, despite the worse performance.

## Cohen’s kappa is higher for balanced data

When we calculate Cohen’s kappa, we strongly assume that the distributions of target and predicted classes are independent and that the target class doesn’t affect the probability of a correct prediction. In our example this would mean that a credit customer with a good credit rating has an equal chance of getting a correct prediction as a credit customer with a bad credit rating. However, since we know that our baseline model is biased towards the majority “good” class , this assumption is violated.

If this assumption were not violated, like in the improved model where the target classes are balanced, we could reach higher values of Cohen’s kappa. Why is this? We can rewrite the formula of Cohen’s kappa as the function of the probability of the positive class, and the function reaches its maximum when the probability of the positive class is 0.5 [1]. We test this by applying the same improved model (figure 2) to different test sets, where the proportion of the positive “bad” class varies between 5% and 95%. We create 100 different test sets per class distribution by bootstrapping the original test data, and calculate the average Cohen’s kappa value from the results.

Figure 3 shows the average Cohen’s kappa values against the positive class probabilities - and yes! Cohen’s kappa does reach its maximum when the model is applied to balanced data!

## Cohen’s kappa says little about the expected prediction accuracy

The numerator of Cohen’s kappa, p_{0}-p_{e} tells the difference between the observed overall accuracy of the model and the overall accuracy that can be obtained by chance. The denominator of the formula, 1-p_{e}, tells the maximum value for this difference.

For a good model, the observed difference and the maximum difference are close to each other, and Cohen’s kappa is close to 1. For a random model, the overall accuracy is all due to the random chance, the numerator is 0, and Cohen’s kappa is 0. Cohen’s kappa could also theoretically be negative. Then, the overall accuracy of the model would be even lower than what could have been obtained by a random guess.

Given the explanation above, Cohen’s kappa is not easy to interpret in terms of an expected accuracy, and it’s often not recommended to follow any verbal categories as interpretations. For example, if you have 100 customers and a model with an overall accuracy of 87 %, then you can expect to predict the credit rating correctly for 87 customers. Cohen’s kappa value 0.244 doesn’t provide you with an interpretation as easy as this.

# Summary

In this article we have explained how to use and interpret Cohen’s kappa to evaluate the performance of a classification model. While Cohen’s kappa can correct the bias of overall accuracy when dealing with unbalanced data, it has a few shortcomings. So the next time you take a look at the scoring metrics of your model, remember:

**Cohen’s kappa is more informative than overall accuracy**when working with unbalanced data. Keep this in mind when you compare or optimize classification models!- Take a look at the
**row and column totals**in the**confusion matrix**.**Are the distributions of the target/predicted classes similar?**If they’re not, the maximum reachable Cohen’s kappa value will be lower. - The
**same model**will give you**lower values of Cohen’s kappa for unbalanced**than for**balanced test data**. **Cohen’s kappa says little**about the**expected accuracy**of a**single prediction**

## Example workflow - Cohen's Kappa for Evaluating Classification Models

The workflow used for this study is shown in Figure 4. In the workflow we train, apply, and evaluate two decision tree models that predict the creditworthiness of credit customers. In the top branch, we train the baseline model, while in the bottom branch we train the model on the bootstrapped training set using the SMOTE technique.

# References

[1] Bland, Martin. "Cohen’s kappa." *University of York Department of Health Sciences **https://www-users.york.ac.uk/~mb55/msc/clinimet/week4/kappa_text.pdf**. [Accessed May 29 2020]* (2008).