Which performance metrics do we use in the Studio?

Question

Accepted Answer

Summary

Having a look at our model's performance metrics, we can understand our models deeper. We can compare them and decide if they are state-of-the-art models or improve their performance. This section shows the analytical descriptions of all the performance metrics that we use in Studio.

Performance Metrics

The first thing that a data scientist does to check its model performance is to look at the confusion matrix! The confusion metric shows us the model labels’ predictions, e.g., for the sentiment, compared to the actual labels (extracted from the annotated data). We can calculate the below performance metrics based on the confusion matrix to understand our model’s performance.

https://malcolm-en-gb.s3.eu-west-1.amazonaws.com/instances/oiJS2eQZyE/resources/mGPNnsaodw/Screenshot%20from%202021-05-20%2016-13-35.png (Screenshot from 2021-05-20 16-13-35.png)

Before we proceed with the performance metrics definitions, let’s clarify here what the True Positives, the True Negatives, False Positives & False Negatives are so we understand more about the role of the None labels in our models. We do not confuse any of these concepts with the sentiment labels.

Keep in mind that our models try to be “smart” and their strategy is to avoid making mistakes rather than randomly guessing an answer when they are not confident about a prediction. Hence, if it is not “clear” for a trained model which label it should predict, the model does not anticipate any label resulting in assigning the None label to the specific piece of text. For example, if an ABSA model trained on finance data “sees” the following text:

“The weather is nice.”

It will return a None label, which means that it could not detect any aspects of the given text. Now, this is called a “negative,” and when it is predicted correctly (there is no aspect in the text), it is more precisely called a True Negative.

On the other hand, when the model was wrong, and it should have predicted some labels for the given, but it predicted a Negative (None), we have a False Negative.

So, the True & False Negatives are nothing more than the correct & false predictions of the None case (no label was assigned to the text/review), respectively.

It is a much easier task to understand the True and False Positives as they are the correct & false predictions of the different labels, respectively. So, for an ABSA model, we define the Positives as the cases where: we have at least one, predicted by the model, aspect, and there was at least one aspect assigned to the exact text, based on the annotated data (actual). To define if this Positive prediction was True or False, we compare the predicted & actual aspects, and we call our Positive as True Positive when we have the same aspects. Similarly, we call a Positive a False Positive when the predicted aspect is not the same as the actual aspect.

In a nutshell:

Having a look at the accuracy formula, we understand that the accuracy of a model shows the analogy between the correctly predicted by the model labels and the total predicted by the model labels. Hence, for an Aspect Based Sentiment Analysis model, the accuracy score shows the ratio of the correctly predicted by the model aspect-sentiment pairs to the total predicted pairs. Note here that the predicted as None labels (no aspect was detected in a piece of text) are also included in this calculation.

Accuracy = (TP+TN)/(TP+FP+FN+TN)

Sometimes, we need to investigate if the model returned many None labels (Negatives) and how many were predicted correctly as None (True Positives). Sometimes we want to see how our model predicted many correctly predicted labels (True Positives). And sometimes, we want to see if there were no predictions for a particular case, e.g., sentiment Neutral. These investigations can help us understand our model’s performance in-depth and discover what we need to improve it. For this purpose, we have a look at the following two metrics.

Precision and Recall are two metrics that focus on how well our model learned to predict labels from our data. They show that the True Negatives score (how many None labels were predicted correctly) by leaving outside of their formulas.

Here are their formulas:

Precision = TP/(TP+FP)

More specifically, precision represents the ratio of the correctly predicted labels to the sum of the correctly predicted labels and the falsely predicted labels. Recall represents the ratio of the correctly predicted labels to the sum of the correctly predicted labels and the falsely predicted as None.

In other words, precision is the number of correct results divided by all returned results. At the same time, recall is the number of accurate results divided by the number of the results that should have been returned.

Lastly, let’s look at the F1 score, which is a combination of the Precision and the Recall score. You can see its formula below:

F1 = (½)(PrecisionRecall)/(Precision + Recall)

We understand that F1 is a balance between the other two metrics, and we use it when we have many True Negatives (actual Nones).

The above are the four standard performance metrics used to describe a model’s performance. You can see all these metrics for your custom Studio models in the training section, in the Training Overview, and after selecting the expert mode, as shown in the picture below.

https://malcolm-en-gb.s3.eu-west-1.amazonaws.com/instances/oiJS2eQZyE/resources/4psYlyDM3w/Screenshot%20from%202021-05-20%2016-14-51.png (Screenshot from 2021-05-20 16-14-51.png)

But we did not stop there! We wanted to compare your model’s “IQ” with a human’s ability to annotate correctly, as a model can only be as “smart” as its training dataset allows it to be. For this purpose, we compare the F1 score from your model to the F1 scores from many of our high-performing industry models (you can find these models in our Model Library). This score is called the Overall Score or Standardized Score, and its color indicates how strong your model is. You can see in the picture below a scale that represents your model’s performance going from the lowest one (red) to the highest one (dark green).

https://malcolm-en-gb.s3.eu-west-1.amazonaws.com/instances/oiJS2eQZyE/resources/4KUUaXPtK5/Screenshot%20from%202021-05-20%2016-15-32.png (Screenshot from 2021-05-20 16-15-32.png)

In the case of having a MultiClassLabel model like an ABSA model, we calculate the Model’s overall score (ABSA score) as a combination of the F1 Standardized Aspect Score and the Sentiment Task Weighted Accuracy Standardized Score.

Based on research conducted by the University of Innsbruck, we know that humans put significantly more weight on the correctness of the sentiment label while they annotate data. For this reason, we weigh the contribution of the Sentiment Task Weighted Accuracy Standardized Score to the calculation of the ABSA score.

Lastly, for calculating the Sentiment Task Weighted Accuracy Score, we calculate the accuracy (as discussed above) of the examples where the aspect was already correctly detected. Then, we compare this score with similar scores in high-performing industry models to standardize it.