How can I understand better my custom model?

Model Explainability

AI Models are powerful tools as they can save us time from manual work, and they often predict labels more efficiently than a human, as they can process a large amount of data automatically. On the other hand, model explainability is one of the biggest problems in data science, but do not worry as we have a plan!

We present below several steps to follow when you want to investigate a model’s quality. These are the exact steps that our data science team would follow to examine your model.

  • Step 1 - Understand your goal
  • Step 2 - Understand the training dataset
  • Step 3 - Understand one aspect’s distribution
  • Step 4 - Understand the distribution of all the aspects
  • Step 5 - Model training & sample size
  • Step 6 - When it feels wrong
  • Step 7 - Evaluation scenarios & none labels

Step 1 - Understand your goal

Today, decision-making in business is based on data. The challenge there is to find good quality data to make the best decisions. Till recently, we could process massively only structured data, i.e., numbers, but what about all the hidden gems in the unstructured data?

Imagine that you have a hotel and you want to know more about your guests’ experience. How could you ever achieve that efficiently without considering their emails, reviews, and voice under consideration? How can you understand if they feel positive or negative about your business? On top of that, wouldn’t you like to know which aspects of your business are doing well and which ones you should improve on?

I remember a Hotel manager who could see low ratings of her hotel on TripAdvisor, but she could not comprehend the issue of those low ratings. She could read some of the thousand reviews online. Still, she knew nothing about the overall picture of the hotel as she could not transform the unstructured data (customer reviews) into structured ones. If she could manage to achieve that, she would be able to efficiently monitor the hotel’s performance on different aspects such as food service, food quality, hotel’s location, room cleanliness, etc. Imagine if she had a model on which she could feed all of the written reviews and receive as an outcome the total number of positive or negative reviews per aspect. This would help her gain a high-level view of her business and identify the particularly nasty aspects. An AI model that deals with text could help her prioritize the hotel’s needs and act more efficiently as a hotel manager.

We assume that you already have a few thousands of unstructured data examples, e.g., product reviews, consumer complaints, emails, or any other text you would like to extract insights from. This is a challenging and time-consuming process, as you already know because tasks like manual labeling need a lot of sources to be executed.

So, you decided to build a model that will automatically transform this data into structured ones (labeled data) that you can measure and use to make data-driven decisions, right?

In the process of building a custom model, you had to collect your data, label a few examples, and then use DeepOpinion’s Text Intelligence Platform to train your custom model. So, here our goal is to understand the best model for YOUR business’s current needs to help you automate your processes, add new features to them, and better approach your users.

The power of these models is that they are tireless, fast, accurate, and easy to use or update when you want to. Their only weakness is that they cannot guess how you would like them to label your data, so we highly suggest you have a clear idea of what you would like to achieve and how you should teach your model.

Step 2 - Understand the training dataset

Let’s continue with something that you are already familiar with, or at least you took a taste of it during annotation: your labeled data or, in other words, what the model knows as training data.

During manual labeling, and probably without even noticing it, you also better understand your data variation. What is the data variation, and how does it affect your model’s performance?

I would like you to keep in mind one specific scenario, feature, aspect, label, or whatever you want to predict with your model until the end of this section. Do not focus on how many of them you want your model to recognize in total, e.g., 23 unique aspects, as we will talk about this type of variety later. Let’s focus for now on one specific case.

Imagine that hotel visitors wrote the following reviews.

Example - Hotel reviews

  1. The service in this restaurant is terrible.
  2. The waiter treated me like I am a dog.
  3. I never go back as the way they offer food to people is hideous.
  4. The service was fantastic.
  5. Service here felt like Christmas.
  6. Do you expect to have customers with your current service?
Screenshot from 2021-05-18 18-02-44.png

This hotel faces a problem because it has visitors coming from different places globally, and they all have diverse educational backgrounds. This leads to having reviews expressed with a different vocabulary, other words, and different tones.

There are many ways to say that something is excellent or nasty as we do not all have the same vocabulary or manner of expressing things. Especially when humor, sarcasm, or proverbs are used, it is sometimes difficult even for humans to understand a text (this is why we prefer to call someone to clarify things). Above, you see how our AI Model which is trained on hotel data, was able to correctly label the sentiment to all cases except for the last one where it produced no label (we will come back to this later).

We talk about homogeneity when there are only a few ways of saying something, and we do not have many variations of the same thing. While working with language models, it is crucial to understand that if a specific word appears with many different synonyms in our training data, the sample is not homogenous for this word. The sample’s inhomogeneity results in making it more difficult for the model to learn this word in the first place. After adding more labeled data (more synonyms), we would have a more potent model to recognize all the different terms used for a specific concept.

Another example is to imagine what would happen to our data if we were translating it to another language. Imagine that you are a tourist and want to write a review about the hotel you just visited. Your mother tongue is Chinese, and you do not speak English. To make sure that the hotel manager will understand your critique so she will take action on the matter that left you disappointed, you use Google Translate to translate your review into English and then post it online.

Passing text through a translator could lead to simplification of the language. Hence training a model on native data (heterogeneous dataset with a flatter data distribution) would lead to a model that can generalize with a minor error than a model trained on translated data (homogeneous dataset with a steeper bell curve). The last option might be faster if we do not have native data but have similar data in another language, leading to a more limited version of the actual data model.

Step 3 - Understand one aspect’s distribution

The crowd’s wisdom is the collective opinion of a group of individuals rather than a single expert!

In our hotel example, the above concept could be translated as: the majority of the words used in the customers’ reviews to describe an aspect, e.g., cleanliness, will be commonly used by the future reviewers to describe the same aspect.

In other words, if some words/phrases are frequently used to describe cleanliness in general, we expect to see these words appear more frequently in our data sample. On the other hand, if a specific word/phrase is not present (or it is present but in a minimal frequency) in the cleanliness examples, then it is probably true that this word/phrase is not used in general for describing the aspect of cleanliness.

The range and the frequency of the different words used for describing one aspect describe an aspect’s distribution.

Comparing the two examples below, we can see that we have a wider variety of words used for cleanliness in the second example. In both cases, the words beautiful and costly are not saying anything about the aspect of cleanliness. Since we did not teach our model through annotation that these two words talk about cleanliness, in an example with only these (or other similar) words, the aspect of cleanliness would not have been detected.

Example 1: Variety & frequency of words that describe Cleanliness in a Hotel Model Screenshot from 2021-05-18 17-26-40.png

Example 2: Variety & frequency of words that describe Cleanliness in a Hotel Model Screenshot from 2021-05-18 17-26-50.png

The number of the different scenarios and their frequency in a dataset describes the dataset’s variety or, in other words, its distribution. The normal distribution is depicted as a bell curve, which is steeper (minor standard deviation) when the examples are not spread apart (less diversity in the ways that someone can describe a specific scenario, aspect, etc.). To sum up, the bigger the variety we want to catch, the more extensive the sample’s size should be. This also depends on how rich a language is.

From our experience on opinion mining, having approximately 40 examples for a particular scenario, aspect, case, label (depending on your model) is usually enough to train a finetuned language model. For a not finetuned language model, we would need more: approximately 70 examples.

Step 4 - Understand the distribution of all the aspects

In this section, we will not focus on a specific scenario, class, aspect, label, etc., as we did before, but we will keep an eye on the high-level view of our use case. For example, in the hotel’s case, we would like to know how many different aspects of the hotel we should examine. What are the visitors talking about? Knowing which aspects are essential, the hotel manager can decide which aspects should be included in her AI model.

Now, we understand why it is essential to know: how many classes and which ones we want our model to be able to predict. This number will affect the size of the training dataset.

We have a finetuned on the Hotel industry English model (which means that it understands English words in the Hotels domain), and we want to train it on the ABSA task (detect aspects & sentiments). If we want the model to know 6 aspects and 3 sentiment labels, we will need approximately 6340 = 720 annotations for training based on what we have said so far.

How do we decide on the aspects when we have unlabeled data? An industry expert is often aware of the different classes or aspects of a product. A product owner usually knows exactly which aspects they are interested in (if not, some research should be done before labeling data). Another way to come up with a model’s aspects is to look at the data and try to label a few examples manually based on their context. After a few examples, you will see that the labels are repeated. If you continue for a while and no new labels come up while you have annotated more than 40 examples per label, then there are probably not more new scenarios.

How do we decide on the aspects when we have labeled data? If we have a lot of labeled data, then we can just take a sample of it. If we choose random sampling as a sampling method, we should take into consideration our data distribution. In the case of imbalanced datasets, we should balance our dataset (using an undersampling or an oversampling method) before sampling. This will increase the chances of the not frequently appeared cases being present in our sample. Have a look below to understand what is considered an imbalanced dataset.

Example 3: Imbalanced dataset Screenshot from 2021-05-18 17-29-02.png

The dataset that is depicted in the above bar plot is imbalanced because the class functionality has significantly fewer examples than the other three classes. If we sample from it before balancing it, then it is highly probable that we will now “catch” many functionality examples in our sample.

Step 5 - Model training & sample size

After annotating data, we learned more about the samples’ distribution and their homogeneity. Now, we understand part of the results and the performance metrics, but these are still not enough to explain:

  1. Why, even though I provided some labeled data for a particular aspect, the model still does not know this aspect and returns a None label?
  2. How many annotations are considered as enough for teaching something to the model?
  3. Why does it seem like my model can’t learn from the training data?

After answering these last questions, you will have a 360 degrees understanding of your training data and how your model learns from it!

An essential thing that will help you understand what and why should someone take under consideration to address all the above three questions is to keep in mind that the annotated data is split into training and testing data, i.e., data for teaching the model and data for testing if the model learned what it was supposed to know. The annotated data is randomly split into training and testing data. We usually keep the majority for training, e.g., 80% of all annotated data is used for training, while less for testing, e.g., 20%.

Now imagine that only one visitor wrote one review regarding the food in the hotel, and the visitor was happy about it. Hence his review was positive. He wrote the following review:

“The food was tasty, cheap & healthy in the Hotel’s restaurant.”

If we add this example to our annotations with the labels (“Food”,”Pos”) and we retrain our model, then we will have the following cases:

  1. After the random split, this sentence falls in the training set; hence the model learns about this case, but since this sentence is not present in the testing set, we will not see any increase in the model’s performance.
  2. After the random split, this sentence falls in the testing set; hence the model does not learn about this sentence. When later, we test it on this sentence, it will probably return a None label (no prediction for the aspect or the sentiment) as it had never seen this case before (the case was not present in the training set).

Someone can now better understand why the model doesn’t necessarily learn about it even by providing an example. To enhance learning, we offer more than one example per case. The last step will assure that we have better chances of including the case example in the training set after splitting, and our model will be able to learn from it.

Step 6 - When it feels wrong

We would like to explain here one tool that can help us better investigate a “strange” situation and be persuaded that our model is robust and trust it. Sometimes what feels normal and logical is not the same as what our model predicts but this does not mean that the model is wrong. This only means that we either did not provide enough information to the model to “see” things the way we see them or that sometimes we tend to generalize and focus on a specific tree rather than seeing the whole forest. Our model achieves keeping holistic and closer to reality view by using statistics. To analyze a bit further the last two bits, I will refer to 2 different business cases.

A few months back, a customer came to us by sharing their concerns that their model was not “smart” enough as it kept predicting some reviews as negative ones even though they had a 5-star rating. In our customer mind, the assumption was that a 5-star review must be positive, while the model that did not even know if 5-star was a high score or not (how could it after all since no one explained it to the model) kept doing what it knew: judge a review from its content. It turned out that 5-star reviews can have negative comments included (additionally to positive ones); hence the model was working correctly. Additionally, we learned that even if the metadata paints a specific picture like a 5-star rating, the model only learns from the text input (e.g., reviews) included in the training data and might give different output than the metadata.

Another customer felt that even though their two different models had the same recall score (a performance metric that affects the number of None labels), batch analysis results from model A had many more None labels than model B. The exciting word to focus on here was the word “felt”. We trust the statistics, e.g., the ratio of the total number of None labels to the total number of all predicted labels, because as human beings, we tend to hook on a feeling and miss the objective truth.

Let’s see what happened with the above case. The two models had the same recall score, but this means nothing about the training data of the two models (if they were different, then the models knew different things) or the datasets that were used for batch analysis. Now, let’s investigate all the different cases here based on what we learned so far:

i) the customer compared the performance of model A & B on different batch data (size and content)

  1. If the dataset that was batch analyzed by model B was more extensive, e.g., 10 times bigger, then it is expected to have more None labels, 10 times more respectively, so it could create a feeling of having a higher recall.
  2. If models A & B were trained on different training sets, they knew different things. If the batch dataset analyzed by model A included many “unknown” cases, then model A would return many None labels. If the dataset that was batch analyzed by model B included many “known” for model B cases, then model B would not return many None labels. In this case, the customer had correctly felt that this is the case, and it would mean that model A had to make predictions on a different data distribution than the one that it was trained on, while model B had a better performance (smaller recall score) as the training set and the batch set had similar distributions.

ii) the customer compared the performance of model A & B on the same batch data (size and content)

  1. If models A & B were not trained on the same training set, it is customary to expect to know different things and react differently on different datasets.
  2. If models A & B were trained on the same training set, we should look at the order of the batch examples, e.g., all the None labels were present in the first rows of model A. This could lead us to feel that there are more None labels for model A. Here we would have the same recall score on the two different batch datasets.

Step 7 - Evaluation scenarios & none labels

In the last case, we tasted what can happen if we train our model on a different area and then use it on a different one. To provide more insights on this issue, I will introduce here the transfer learning concept, which is the usage of the gained from one thing knowledge on learning something else, though a simple example.

Let’s assume that we have 5.000 labeled reviews for restaurants and 5.000 labeled reviews for laptops, and we want to build a model. We understand that if we use the labeled data coming from the restaurant reviews, we will construct a specialized model, or in other words, finetuned, on restaurant reviews. If we were using only laptop reviews, then we specialized in reading the laptop reviews model. But what would happen if we were using all of our data, and how well would a laptop model react to restaurant reviews and the opposite?

When someone trains a model on a domain-specific dataset, e.g., restaurant reviews, then uses it to do batch analysis (predict labels) on the same domain. We call this process In Domain Training.

Now assume that you train a model on a domain-specific dataset, e.g., restaurant reviews, and evaluate it (check its performance) on datasets from a different domain, e.g., laptop reviews. This process is called Cross-Domain Training.

In the first case, the model’s performance is expected to be higher as the model is evaluated on the same domain that it has been trained. If someone wants to create a model that performs well in different fields, then Cross-Domain Training would make more sense.

Let’s see now what will happen if we use datasets from different domains for training. The process of training a model on a downstream task such as the ABSA task on a training dataset created by the join of datasets coming from different domains, e.g., laptop and restaurant reviews, and evaluating it on other datasets independently is called Joint-Domain Training.

In the last case, someone might expect that a Joint-Domain Training model is “wiser” than an In-Domain Trained model because it has “seen” more. This is not always true, as it depends on how close the different domains are. For example, if we train a model on data coming from a mobile bank app and a food mobile and then evaluate it on data coming from the first domain (mobile bank app), we would expect a better performance than training a model on bank mobile app data and on hotel data.

We use a pre-trained language model as a base because it has been trained on knowledge-based corpora like Wikipedia, which means that the model already knows things about the language syntax and grammar and understands the context of some words based on how they are used in this corpora. When we finetune a model on an opinions’ corpora, e.g., reviews, coming from a specific domain, we show the model how to interpret words based on the special meanings within the particular discipline.

For example, the word cheap can have a positive or a negative meaning. “The hotel manager’s behavior was very cheap.” Aspect: Service Sentiment: Negative “The app is cheap in comparison to its performance.” Aspect: Payment Sentiment: Positive The word cheap can have a negative meaning in the Hotel domain, while in the Banking mobile app domain can have a positive meaning.

To sum up sometimes less is more! If we want our model to perform well on a specific domain, we should train it and use it on data from that domain (In-Domain Training). If we use data sources from fields that are not similar and some of their words have opposite meanings, then we will rather “confuse” our model than improve it. To understand why, imagine that if the model has seen examples only from the first case, then it would be confident to predict the Sentiment Label as Positive. On the other hand, if the model has seen examples from both domains (hotel, banking), it knows that “cheap” can be interpreted positively or negatively. Then to avoid making a wrong prediction, it will take a step back, and it will not predict any label. Having many None labels (no predicted labels) leads to a high recall metric.

I will leave you with this, in data science, we make decisions based on our data. Data collection is not a big issue today. On the contrary, the volume and the velocity of Big Data force us to use powerful AI models to process information faster and better by using brutal computational force and smart algorithms to achieve data transformation and knowledge extraction. Having this heavy task automated by cutting-edge software is impressive. It saves us a lot of time and time that we can use to understand these tools and then use them to process information massively. Information leads to knowledge, and we can use it to respond better and faster to changes, understand our business and make better decisions.