What is Cohen's kappa and when should I use it?

Cohen's kappa is a metric that measures agreement between two raters or classifiers, accounting for chance agreement. It's particularly useful when evaluating classification models on imbalanced datasets where overall accuracy can be misleading. Unlike accuracy, Cohen's kappa considers the probability of random agreement and provides a more objective assessment of model performance.

How is Cohen's kappa calculated?

Cohen's kappa is calculated using the formula: κ = (p₀ - pₑ) / (1 - pₑ), where p₀ is the overall accuracy and pₑ is the expected agreement by chance. For binary classification, pₑ is calculated by multiplying the proportions of actual and predicted classes for each category and summing them up.

What are the limitations of Cohen's kappa?

Cohen's kappa has several limitations: 1) It's easier to achieve higher values with balanced class distributions, 2) The maximum reachable value depends on the similarity between predicted and actual class distributions, 3) It doesn't provide intuitive interpretation in terms of expected prediction accuracy, and 4) The same model will yield lower kappa values on imbalanced test data compared to balanced data.

Why does Cohen's kappa perform better than accuracy on imbalanced datasets?

On imbalanced datasets, overall accuracy can be misleadingly high because a model can achieve high accuracy by simply predicting the majority class most of the time. Cohen's kappa corrects for this bias by accounting for chance agreement and considering the performance on both majority and minority classes, providing a more realistic assessment of model performance.

What does a Cohen's kappa value mean?

Cohen's kappa ranges from -1 to +1. A value of +1 indicates perfect agreement, 0 indicates agreement no better than chance, and negative values indicate agreement worse than chance. For example, a kappa of 0.244 suggests modest agreement beyond chance, while 0.452 indicates moderate improvement. However, kappa values should be interpreted carefully considering the class distributions and context of the problem.

What Is Cohen’s Kappa? How and When to Use It (Plus Common Pitfalls to Avoid)

What is Cohen’s kappa?

Cohen’s kappa is a statistical metric used in machine learning to score the performance of a classification model. It quantifies the agreement between two classifiers on categorical labels while correcting for the agreement that could occur by chance.

Put simply, Cohen’s kappa measures the agreement between two raters assessing the same thing. Sometimes, raters agree just by luck, so what makes Cohen’s kappa special is that it takes chance into account and corrects for that. This makes it particularly valuable when your data is imbalanced and one class heavily outweighs the other.

In this article, you’ll learn:

When and why to use Cohen’s kappa, especially with imbalanced datasets
How to calculate Cohen’s kappa
Important pitfalls of Cohen’s kappa and how to avoid them
How model balancing techniques (e.g. SMOTE) influence Cohen’s kappa
How to evaluate models with the low-code tool, KNIME Analytics Platform

When to use Cohen’s kappa?

#1. When your data is imbalanced

In many real-word problems such as fraud detection, credit scoring, medical diagnosis, data can be imbalanced: One class (e.g. good credit) is much more frequent than the other (e.g. bad credit).

In these cases, overall accuracy becomes misleading because a model can be highly accurate simply by predicting the majority class.

How Cohen’s kappa helps: It adjusts the agreement that could occur simply due to the majority class, providing a more realistic estimate of the model’s performance.

Example: Imbalanced credit rating data

Using the German Credit Rating dataset, we artificially increased imbalance to 10% “bad” vs. 90% “good” credit ratings. A simple Decision Tree achieved:

Accuracy of 87%
Sensitivity (bad credit) of only 30%
Cohen’s kappa: 0.244

The model looks accurate, but performs poorly on the minority class. Cohen’s kappa adjusts for this imbalance.

#2. When comparing human labels to model predictions

When training models in a supervised learning setup, we often treat human-labeled data as ground truth. Cohen’s kappa helps evaluate:

The extent to which a model’s predictions agree with human labels
How consistent human raters are with each other

#3. When you need a chance-corrected agreement metric

Unlike accuracy, Cohen’s kappa incorporates:

Observed agreement
Chance agreement (based on the distribution of predicted and actual classes)

Therefore, Cohen’s kappa can help evaluate more robustly whether an observed agreement exceeds what would be expected by chance given the label distributions.

How to interpret Cohen’s kappa in practice

Cohen’s kappa on imbalanced data

In the imbalanced credit rating example:

The model correctly predicted most “good” customers.
It missed most “bad” customers.
Accuracy looked high, but Cohen’s kappa revealed the model’s inability to successfully predict the minority class.

To help us understand how well the model is doing beyond just accuracy, we look at Cohen's kappa value, which is 0.244 in this case. It's in a range from -1 to +1. The closer to +1, the better. So, 0.244 isn't great, indicating there's room for improvement in the model's performance.

Because Cohen’s kappa penalizes chance alignment with the dominant class, it exposes situations where the model is “cheating” by predicting the majority class.

You can see this in the confusion matrix and class/overall statistics for the baseline model. This is a Decision Tree model trained on our highly imbalanced training set. Here you can see the high overall accuracy (87%), although the model detects just a few of the customers with a bad credit rating (sensitivity just at 30%).

Accuracy: 87%
Sensitivity (bad credit): only 30%
Cohen’s kappa: 0.244

Imbalanced data table — Fig. 1 Baseline model

Cohen’s kappa on balanced data

To improve the model’s performance, we can force it to acknowledge the existence of the minority class – the “bad” customers.

To do that, we train the same model on a training set where the minority class has been oversampled using the SMOTE technique.

This creates a more balanced dataset, with a 50/50 split between “good” and “bad” credit customers.

The improved model performs better on the minority class, correctly identifying 18 out of 30 customers with a “bad” credit rating. Cohen’s kappa for this improved model increases to 0.452. This is a significant improvement in agreement. Overall accuracy here is 89% – not so different from the previous value, 87%.

Accuracy: 89% (almost unchanged)
Cohen’s kappa: 0.452 (significant increase)
Sensitivity for “bad” credit: 60%

Balanced data table — Fig. 2. Improved model

This demonstrates that accuracy barely changed, but Cohen’s kappa nearly doubled. This reflects the model’s true improvement.

How to calculate Cohen’s kappa

Cohen’s kappa is defined as:

κ = \frac{p_{0} - p_{e}}{1 - p_{e}}

Where:

p₀ = the observed agreement (“o” for observed), a.k.a. the overall accuracy, of the model
pₑ = the expected agreement (“e” for expected) of the model, i.e., the agreement between the model predictions and the actual class values as if happening by chance

In a binary classification problem, like ours, pe is the sum of pe1, the probability of the predictions agreeing with actual values of class 1 (“good”) by chance, and pe2, the probability of the predictions agreeing with the actual values of class 2 (“bad”) by chance.

Assuming that the two classifiers - model predictions and actual class values - are independent, these probabilities, pe1 and pe2, are calculated by multiplying the proportion of the actual class and the proportion of the predicted class.

If we consider “bad” as the positive class, the baseline model (fig. 1) assigned 9% of the records (false positives plus true positives) to class “bad”, and 91% of the records (true negatives plus false negatives) to class “good”. Thus pe is:

p_{e} = p_{e 1} + p_{e 2} = p_{e 1, target} * p_{e 1, pred} + p_{e 2, target} * p_{e 2, pred} = 0.90 * 0.91 + 0.10 * 0.09 = 0.828

Therefore, Cohen’s kappa statistics:

κ = \frac{0.870 - 0.828}{1 - 0.828} \approx 0.244

This is the same value you can see in the baseline model.

In simple terms, Cohen’s kappa helps us evaluate a classifier by considering how often it makes predictions that go beyond random chance.

Pitfalls of Cohen’s Kappa (and How To Avoid Them)

Cohen’s kappa is powerful, but it is not perfect. Below are the key pitfalls.

Pitfall #1. Cohen’s kappa’s range is theoretically [-1, 1], but not equally reachable

Why it happens

The maximum achievable Cohen’s kappa depends on how closely the predicted and actual class distributions match. When the model and the true labels assign classes in similar proportions, higher Cohen’s kappa values are easier to reach.

If predicted and true class proportions differ significantly, the expected chance agreement increases. This lowers the maximum attainable κ.
As a result, two models with identical accuracy can have different κ values simply because their predicted class proportions differ — making direct comparisons misleading.

How to avoid this pitfall

Compare models on the same dataset, not across datasets with different class ratios.
Inspect row/column totals in the confusion matrix.
Consider processing the dataset carefully so class proportions are consistent when comparing models (e.g., balancing, stratified sampling)

See how this plays out in an example

In our baseline model (see the first table above), the distribution of the predicted classes closely follows the distribution of the target classes:

27 predicted as “bad” vs. 273 predicted as “good”
30 being actually “bad” vs. 270 being actually “good”.

For the improved model (see the second table above), the difference between the two class distributions is greater:

40 predicted as “bad” vs. 260 predicted as “good”
30 being actually “bad” vs. 270 being actually “good”.

As the formula for maximum Cohen’s kappa shows, the more the distributions of the predicted and actual target classes differ, the lower the maximum reachable Cohen’s kappa value is.

The maximum Cohen’s kappa value represents the edge case of either the number of false negatives or false positives in the confusion matrix being zero, i.e. all customers with a good credit rating, or alternatively all customers with a bad credit rating, are predicted correctly.

κ_{\max} = \frac{p_{\max} - p_{e}}{1 - p_{e}}

where pmax is the maximum reachable overall accuracy of the model given the distributions of the target and predicted classes:

For the baseline model, we get the following value for pmax:

p_{\max} = \min (0.10, 0.09) + \min (0.90, 0.91) = 0.99

Whereas for the improved model it is:

p_{\max} = \min (0.10, 0.13) + \min (0.90, 0.87) = 0.97

The maximum value of Cohen’s kappa is then for the baseline model:

κ_{\max} = \frac{p_{\max} - p_{e}}{1 - p_{e}} = \frac{0.99 - 0.828}{1 - 0.828} = 0.942

For the improved model it is:

κ_{\max} = \frac{p_{\max} - p_{e}}{1 - p_{e}} = \frac{0.97 - 0.796}{1 - 0.796} = 0.853

As the results show, the improved model with a greater difference in the distributions between the actual and predicted target classes can only reach a Cohen's kappa value as high as 0.853. Whereas the baseline model can reach the value 0.942, despite the worse performance.

Pitfall #2. Cohen’s kappa is higher for balanced data (even with the same model)

Why it happens

Kappa is maximized when the positive class probability is 0.5.That means that even a good model will get:

Lower Cohen’s kappa on imbalanced test sets
Higher Cohen’s kappa on balanced test sets

How to avoid this pitfall

Always report the class distribution next to κ.
When comparing models, test on datasets with consistent class ratios.
Use balanced test sets during model development when possible.

See how this plays out in an example

In experiments with multiple bootstrapped test sets, κ peaked when class proportions were equal — even though the model itself never changed.

When we calculate Cohen’s kappa, we strongly assume that the distributions of target and predicted classes are independent and that the target class doesn’t affect the probability of a correct prediction.

In our example this would mean that a credit customer with a good credit rating has an equal chance of getting a correct prediction as a credit customer with a bad credit rating.

However, since we know that our baseline model is biased towards the majority “good” class, this assumption is violated.

If this assumption were not violated, as in the improved model where the target classes are balanced, we could reach higher values of Cohen’s kappa.

Why is this? We can rewrite the formula of Cohen’s kappa as the function of the probability of the positive class, and the function reaches its maximum when the probability of the positive class is 0.5.

We test this by:

applying the same improved model to different test sets, where the proportion of the positive “bad” class varies between 5% and 95%
creating 100 different test sets per class distribution by bootstrapping the original test data, and
calculating the average Cohen’s kappa value from the results

The graph below shows the average Cohen’s kappa values against the positive class probabilities – and yes! Cohen’s kappa does reach its maximum when the model is applied to the balanced data!

Class distributions graph — Cohen’s kappa values (on the y-axis) obtained for the same model with varying positive class probabilities in the test data (on the x-axis). The Cohen’s kappa values on the y-axis are calculated as averages of all Cohen’s kappas obtained via bootstrapping the original test set 100 times for a fixed class distribution. The model is the Decision Tree model trained on balanced data, introduced at the beginning of the article

Pitfall #3. Cohen’s kappa says little about the expected prediction accuracy

Why it happens

κ measures agreement beyond chance — not the number of correct predictions.

For example:

A model with accuracy = 87%
Might have κ = 0.244, but that tells you nothing intuitive about “how many predictions will be right.”

How to avoid this pitfall

Pair Cohen’s kappa with additional metrics:

Accuracy
Precision/Recall
F1 Score
Class-specific sensitivity

Use κ to supplement—not replace—other performance indicators

See how this plays out in an example

The numerator of Cohen’s kappa, p0-pe, tells the difference between the observed overall accuracy of the model and the overall accuracy that can be obtained by chance. The denominator of the formula, 1-pe, tells the maximum value for this difference.

For a good model, the observed difference and the maximum difference are close to each other, and Cohen’s kappa is close to 1.
For a random model, the overall accuracy is all due to the random chance, the numerator is 0, and Cohen’s kappa is 0.
Cohen’s kappa could also theoretically be negative. Then, the overall accuracy of the model would be even lower than what could have been obtained by a random guess.

Given the explanation above, Cohen’s kappa is not easy to interpret in terms of an expected accuracy, and it’s often not recommended to follow any verbal categories as interpretations.

For example, if you have 100 customers and a model with an overall accuracy of 87%, then you can expect to predict the credit rating correctly for 87 customers. Cohen’s kappa value 0.244 doesn’t provide you with an interpretation as easy as this.

Example KNIME Workflow: Cohen’s kappa for evaluating classification models

The workflow we used in this article was built with the open source KNIME Analytics Platform. KNIME Analytics Platform makes experimentation easy with its low-code, no-code interface — ideal for both beginners and advanced data scientists.

hub

Explore the workflow to use Cohen’s Kappa for Evaluating Classification Models.

The workflow:

Trains a baseline Decision Tree on imbalanced credit data
Trains an improved version using SMOTE-balanced data
Evaluates both models with accuracy and Cohen’s kappa
Compares performance side by side

In the workflow we train, apply, and evaluate two Decision Tree models that predict the creditworthiness of credit customers.

In the top branch, we train the baseline model
In the bottom branch we train the model on the training set where the minority class has been oversampled using the SMOTE technique.

free

What is Cohen’s Kappa? How and When to Use It (Plus Common Pitfalls to Avoid)

What is Cohen’s kappa?

When to use Cohen’s kappa?

#1. When your data is imbalanced

#2. When comparing human labels to model predictions

#3. When you need a chance-corrected agreement metric

How to interpret Cohen’s kappa in practice

Cohen’s kappa on imbalanced data

Cohen’s kappa on balanced data

How to calculate Cohen’s kappa

Pitfalls of Cohen’s Kappa (and How To Avoid Them)

Pitfall #1. Cohen’s kappa’s range is theoretically [-1, 1], but not equally reachable

Why it happens

How to avoid this pitfall

See how this plays out in an example

Pitfall #2. Cohen’s kappa is higher for balanced data (even with the same model)

Why it happens

How to avoid this pitfall

See how this plays out in an example

Pitfall #3. Cohen’s kappa says little about the expected prediction accuracy

Why it happens

How to avoid this pitfall

See how this plays out in an example

Example KNIME Workflow: Cohen’s kappa for evaluating classification models

Explore the workflow to use Cohen’s Kappa for Evaluating Classification Models.

Download an ebook to learn about scoring techniques to evaluate machine learning models

You might also like