Classification Metrics for Machine Learning
One of the biggest nightmares of a ML beginner is the different kinds of complicated classification metrics (like Confusion Matrix, Precision, Recall and so on.) that are used by advanced people. As beginners, we normally tend to use Accuracy for evaluating which model performs better. But sometimes accuracy cannot be relied upon. Suppose we have two models which have the same accuracy, what next metric should we use to infer which model is better? That’s what we are going to discuss in this blog.
Table of contents:
- Confusion Matrix
- Accuracy and why it cannot be relied
- Precision and Recall
- Specificity and Sensitivity
- Multi-Class Classification
Let’s first start with the confusion matrix!
For the sake of simplicity, we will stick to 2-Class classification for now because 2-class classification metrics are more widely explored by experts and we can then extend the same ideas for multi-class problems. Let’s say there are two classes: Positive and Negative in our dataset. We will refer to the Positive class as the class of interest to us and the Negative class to be the one which is not of our interest. For example, if we are predicting whether a person has Dengue or not, we will refer to the Positive class as Has Dengue and the Negative class as Does not have Dengue. The Confusion Matrix of the classifier will look like this.
In the Rows, we have the Actual Class and in the Columns, we have the Predicted Class.
The four elements inside the metrics are very very important:
- True Positive (TP): These are the Number of Predictions whose Actual Class is Positive and is correctly predicted by the classifier to be in the positive class.
- False Negative (FN): These are the Number of Predictions whose Actual Class is Positive but are incorrectly classified by the classifier to be Negative.
- False Positive (FP): These are the Number of Predictions whose Actual Class is Negative but are incorrectly classified to be Positive.
- True Negative (TN): These are the Number of Predictions whose Actual Class is Negative and are correctly classified to be Negative.
If this was too much for you, take a look at the example below.
Let’s say we made a Cancer Detection Classifier and the Confusion Matrix looks like this.
Here, 10000 people were diagnosed by the classifier.
- 1000 people were correctly classified as having Cancer (True Positive)
- 8000 people were correctly classified as not having Cancer (True Negative)
- 700 people were incorrectly classified as not having Cancer (False Negative)
- 300 people were incorrectly classified as having Cancer (False Positive)
Now let’s take a look how we can get the accuracy of the classifier from this confusion matrix.
If you look carefully in the confusion matrix, which ones are correctly classified by the classifier?
- True Positives and True Negatives are the ones which are correctly classified by the classifier.
- False Positives and False Negatives are the ones which are incorrectly classified by the classifier.
Therefore the accuracy can be formulated as:
When Can we not rely on Accuracy?
Let’s say we are building a Terrorist Detection Classifier for an Airport. In the Test dataset of 10000 people, we only have 10 people as Terrorist and 9990 people as Not Terrorist.
Instead of building a complex classifier, we can just simply say No one is Terrorist, which means our model says no one is Terrorist. The confusion matrix looks like the above image (Figure 4). Can you guess the accuracy of this model?
The accuracy will be 99.9%. Wow! Without doing anything you reached 99.9%. But is this really correct?
We cannot rely on Accuracy when our Dataset is too Skewed. Like, in this case we have only 10 People belonging to one class and 9990 people on the other. Never rely on Accuracy when the Class Imbalance is very very significant.
In this Terrorist Detection Example, the Accuracy that you found out (99.9%) is called the Null Accuracy, which basically means how much accuracy can you achieve when you do not do anything.
But what metric can we rely on when we cannot rely on Accuracy? That’s where Precision and Recall comes in.
Let’s say we are building an Email Spam Classifier. There are two models that you trained and you need to find out which one is better. The confusion matrices of these models are as follows:
If you look carefully, the accuracies of both the Classifiers are the same ( i.e., 80%).
To explain how one would choose between these two models, let’s explore the conditions where our model went wrong.
There are only two possibilities where our model went wrong:
- Email was Not Spam and was sent to Spam.
- Email was Spam but was not sent to Spam.
Which one of these possibilities are critical?
Ans:- When an Email was Not Spam but was sent to Spam.
Let’s say you had applied for a Job and the Interview Invitation Email was classified to be Spam and therefore you will not get the Job. But on the other hand, if a Spam email comes to your inbox, there is no harm in it.
In this example, we will need a model which reduces the number of False Positives (Email was not Spam but was sent to Spam). Therefore, we will choose the second model because the number of False Positives in the first model is 30 whereas for the second model, it is 10.
This is where Precision comes in.
If we compare the precisions for the two Email Spam Classifiers, we will get
The Precision of the second model is better so we consider the second one as the better model. (More on Precision after Recall)
Let’s say you trained two Cancer Detection Models and the confusion matrices are given to you. Which model would you choose?
Same as our Email Spam classifiers, here too, we have the accuracies to be the same( 90%).
There are two possibilities where our model went wrong:
- Someone Has Cancer but was predicted to Not have Cancer
- Someone Does Not have Cancer but was predicted to Have Cancer.
Which of these two possibilities are critical?
Ans :- The patient Had Cancer but was not diagnosed to have Cancer.
In this example, we will need to reduce the possibility where the Patient had Cancer but was not diagnosed to have cancer, i.e. , we would need to reduce the number of False Negatives. The number of False Negatives of the first model is 200 and that of the second model is 500. Therefore, we will choose the first model as our Cancer Detector (If we do not have one which performs better than this).
This is where Recall is calculated for the two models.
If we compare the recalls of two models, we get
The recall of the first Cancer Detector is greater, so we choose the first classifier.
Allow me to elaborate on Precision and Recall.
- Precision basically tells you out of all the points that your classifier predicted to be positive, how many of them are really positive.
- Recall tells you that out of all the actual positive data points in my dataset, how many points could my classifier get correct.
Precision and Recall comes from Information Retrieval.
Let’s say we have a huge corpus of documents. I am trying to search for some documents using a query string. I want 10 documents which are of interest to me.
- Precision tells you, out of the 10 documents returned how many are truly relevant to the query.
- If originally 50 documents are relevant to the search query out of the whole corpus, Recall tells us how many of those 50 fall in the 10 documents returned as results.
Precision is kept high when we want our False Positives to be low (as in Email Spam Classifier). Recall is kept high when we want our False Negatives to be low (as in Cancer Detection). Therefore, it depends on the use-case that we are making the classifier for.
Now, if you try to increase your Precision, your Recall will fall. And if you try to increase your Recall, your Precision will suffer. Therefore, you will have to make a choice about the tradeoff. In the ideal case, we want the Precision and Recall, both, to be 1 but that is not the case for practical scenarios.
Is there a combined way to view Precision and Recall together?
F1 score is the combined measure of Precision and Recall. It is the Harmonic mean of the two.
Now there are two more terms which are often used in classification metrics widely: Sensitivity and Specificity.
Sensitivity and Specificity
Sensitivity is nothing but Recall. There is a reason why people use two different names for the same measure ( Because they come from two different domains ).
Specificity has a rather peculiar equation.
The terms Specificity and Sensitivity originate from Medical Domain.
- Specificity basically says that if I diagnose a patient to not have Cancer, how likely it is that he really does not have Cancer.
- Sensitivity says, if there are some patients with the disease, how likely am I to find a patient with the disease.
These explanations might seem a little confusing and difficult to understand but they really depend on the use-case that you are applying and the scenario for which you are training your classifier.
Enough of 2-class Classification! How would one extend these ideas to the multi-class domain?
Below is a confusion matrix for Iris Flower Detection. There are three classes: Setosa, Versicolor and Virginica. Take a look at the confusion matrix and see if you can find out what it means.
The confusion matrix is pretty simple. Along the Rows we have the Actual classes. Along the Columns we have the Predicted classes. The Green boxes indicate that those were correctly classified and the Red boxes indicate the ones which were incorrectly classified.
Using the same principles as before we can find out the Precisions and Recalls.
For 2-class classifications when we were calculating the Precision and Recall, we were actually calculating them corresponding to the Positive class because it was the class of interest to us. Similarly, we can also find out the Precision and Recall for the Negative class. But we did not bother finding them out because the Precision and Recall of the Positive class were giving us an idea about how the model would perform with the Negative class. But for multiple classes finding out Precision and Recall for one class cannot give us the information about how the model performs with other classes. That’s the reason we find out Precision and Recalls separately for all the classes.
For the Iris Flower Detection, to find out how our model really performs, we need to find out the Precision and Recall for all of the three classes: Setosa, Versicolor and Virginica.
But how can we find the True Positive, False Positives, True Negatives and False Negatives for some class?
Take a look at the image below.
Stick with the class Setosa for now. The True Positives for the Setosa class is the box where the model correctly classifies the flowers to be Setosa. The False Positives for the Setosa class are the boxes marked in Red in the left confusion matrix. They are the flowers which the model predicted to be in Setosa class but were actually belonging to some other class. The False Negatives for the Setosa class are the boxes marked in Red in the confusion matrix on the right hand side of the image. They are the flowers which were actually Setosa but our model predicted to be in some other class.
Please read the above paragraph again if you could not understand something.
Well, this was for the Setosa class. Let’s see the Versicolor class. This time I will not give any explanation as the ideas are the same.
Same ideas can be applied with the Virginica class too.
Now that we know how to find out the True Positives, False Positives and False Negatives for every class, let’s find out the Precision and Recalls for every class.
True Positive = 20, False Positive = 5 + 5 = 10, False Negative = 20 + 20 = 40
Precision = 20 / ( 20 + 10 ) = 0.67
Recall = 20 / ( 20 + 40 ) = 0.33
True Positive = 40, False Positive = 20 + 20 = 40, False Negative = 5 +5 = 10
Precision = 40 / ( 40 + 40 ) = 0.5
Recall = 40 / ( 40 + 10 ) = 0.8
True Positive = 25, False Positive = 20 + 5 = 25, False Negative = 20 +5 = 25
Precision = 25 / ( 25 + 25 ) = 0.5
Recall = 25 / ( 25 + 25 ) = 0.5
Writing down in a Tabular Format,
Now these numbers might not seem very interesting individually but can lead to very interesting results when compared to other classifiers (It solely and solely depends on the use-case that you are making the classifier for ).
Thank you for sticking with me till here. If you did not understand something in this blog, do not worry. There are loads of materials on the Internet which can give you better intuitions.