Part: High-Quality Training Data Labelling for Opinion Mining Models
In this tutorial we present to you some tips on data labelling to ensure that your model can reach to high accuracy. We will start with the purpose of data labelling, cover 7 principles to follow, and tell you about our cool "Active Learning" feature that can help you automate part of this process.
The Purpose of Data Labelling
Humans learn best by examples. AI models learn the same way!
You can recall from your learning experience that you get to learn a concept by seeing example to experience this concept in its original context and then it is easy for you to make sense of it. The same thing applies to AI model training, you are showing the AI model examples of where a topic and a sentiment appear in text (it's original context). This way the AI model can pick up those and sentiments when appearing in similar sentences. The more examples the better the AI is learning and able to generalize what it learned.
Therefore the job of the data annotator is to help the AI model in associating sentences with concepts (topics and sentiments) as shown in this image below
Labelling Data on DeepOpinion Studio
DeepOpinion Studio offers a user friendly interface that helps you to label data more efficiently and in an interactive way. Compared to other tools, DeepOpinion approach in creating a new model is 10X more efficient. When creating a new model, on step 3 "Training Data". Here are the data labelling steps:
- Navigate to Step 3 "Training Data"
- Click on "NEW LABELLING SESSION +" button on the top right
- On the pop-up window choose session type to be "Annotation" and click on "CREATE LABELLING SESSION" button
- A new session will be added to the page which you can click on to start labelling training data
- The text box on the top right is what you have to label as shown on the below image
- Read the sentence that is highlighted and select the topic and sentiment from the menu below the text
- Once you finish labelling this session you will be notified and taken to the home page
Tip 1: You can label the same sentence with multiple topics and different sentiments if they appear in the text.
Tip 2: When starting, aim to label 10 sessions as a minimum and then train your first model.
The 7 Principles for High-Quality Training Data
Involve domain experts in the process
Define topics to be comprehensive and non-overlapping in coverage
Write definitions for each topic and add examples
Annotate based on explicit mentions of topics with sentiments (= opinion)
Ignore implicit and subjective topics or factual statements
Focus on producing high quality labels, take breaks to refresh focus
Iterate between annotating and training to speed up the process
1. Involve domain experts
Involving domain experts is important to ensure that this expertise is reflected on the training data and hence giving the AI model a good chance to mimic the domain expert. The job title of this domain expert depends on the specific use case. For example when creating a customer review analysis model it is the Customer Experience Manager or for employee feedback it would be the HR Manager.
2. Define topics to be comprehensive and non-overlapping in coverage
Topics should be designed in a way that reflects the what is actually expressed in the text data. Doing this step well means that your list of topics mutually exclusive in meaning and collectively exhaustive in covering all the topics that appears in the text. Therefore, avoid having topics that are similar and can overlap in meaning. This illustration shows how you can cover the content of the text with topics.
3. Write definitions for each topic and add examples
It is a good practice to define topics and adding examples to understand how a topic appears in your text and label in a consistent way. When you label data and unsure if a topic is relevant or not you can refer to it. Additionally it helps other people in labelling the data in a consistent way too.
4. Annotate based on explicit mentions of topics with sentiments (= opinion)
Make sure that when labelling you only select a topic that is explicitly mentioned in the text. This is because AI models can only process and learn from information that is really present in the text.
5. Ignore implicit and subjective topics or factual statements
The AI model can only learn from the text it sees and has no access to external knowledge. Therefore, avoid labelling with topics that do not appear in the text to avoid confusing the model. For example, if a customer review says "why didn't I see this product before" you shouldn't label "Marketing" with "Negative" sentiment as marketing is not referred to in the text.
6. Focus on producing high quality labels, take breaks to refresh focus
This point is self explanatory, taking breaks helps you to stay focused to produce consistent labels that the AI model can mimic.
7. Iterate between annotating and training to speed up the process
Iterate on this process by annotating ~100 examples at a time until you reach the desired performance. As a rule of thumb, aim to label at least 30 examples for each topic so that the AI model can see enough variety of example where such a topic appears. This way the AI model can reach a high accuracy. the data labelling and model training workflow is optimized to help you run through this process efficiently. You can expect to complete training a model from scratch in a matter of few hours. This is how the process looks like.
Active learning is here to automate data labelling
Our goal is to minimize the manual effort in creating a model. Therefore, we released a new feature that helps in automating the data labelling step. In this case, once you label some data and train a model with at least 50% you can switch to "Active Learning". This is how it works:
- The model go through all the unlabelled text and find the top examples that it is least confident about
- It will predict the labels for topic and sentiment
- You will then get to confirm the predicted labels by either ticking the checkbox of correcting the labels if the model got it wrong
This way you provide the model with the most impactful examples to improve its accuracy. Additionally this way you reduce your labelled effort by 5X compared to other conventional ways.
As always, please remember to share your feedback and any questions you have.