VLDB21 Panel Discussion
- Large-Scale Data Excellence
- Trust and Ethics
- Crowd-AI Interplay
A unique group of academic researchers from different study fields, industry experts and crowd performers from Toloka recently came together at the VLDB21 Crowd Science Workshop to talk about the factors that affect data labelling. We prepared a full transcription of this discussion to save your time watching recordings.
moderator: Olga Megorskaya, Head of Toloka
Jie Yang
Assistant Professor at Delft University of Technology
Mohamed Amgad
Visiting Predoctoral Fellow, Pathology, Northwestern University
Zack Lipton
BP Junior Chair Assistant Professor of Operations Research and Machine Learning, Carnegie Mellon University
Grace Abuhamad
Applied Research Scientist, Trustworthy AI, ServiceNow
Ujwal Gadiraju
Assistant Professor, Web Information System group, Delft University of Technology
Ivan Stelmakh
PhD student, Carnegie Mellon University
Novi Listyaningrum
Student at Institut Kesenian Jakarta, Toloka Performer
Konstantin Kashkarov
Freelancer, Toloka Performer
AI is already shaping our lives today. In the future, the recommendations
and decisions made by AI will form part of life's daily fabric. But even AI
systems that are capable of making decisions have to be trained on
data, and this is often obtained by using data labeling. This means that
even small, micro-level changes made in data labeling could have a
direct impact on our future and our lives.
How can we control that?
1. Quality-scale tradeoff is a very common dilemma in machine learning production.
Grace Abuhamad offered her thoughts first: "For the business side, the quality of the product plays a big role, and there is always a data quality issue to manage with the client databases for developing AI capability. On the other hand, many people rely on the advances in low data learning technologies to reduce the importance of the scale question."
At the same time, if we're talking about the medical domain, it's a different picture.
Mohamed Amgad had this to say about Computational Pathology: "We developed a system to automatically detect disease under the microscope and predict its consequences for the patient. One of the problems with a lot of computational AI systems is overreliance on low data schemes; in other words, making a lot of data from a small number of unique patients. Heavy augmentation, GANs, and any other existing methods never compensate for the variability of data from unique patients to provide realistic information and give real-world variation. In the medical field, what is missing is quantity. Quality is otherwise highly important, of course, but quantity remains very limited. If we do not have enough unique patients, the model will not permit generalizations and will give very poor predictions for other patients."

In fact, it's not only the medical domain that suffers from a lack of data, but also every other domain requiring generalizations and broad view of its dynamics.
"Depending on the quality of data, it is possible to define which scale can be accomplished," says Zack Lipton. There is a meaningful subcategorization here: some issues of quality can be overcome with scale, but there are also examples where this is not possible. Also, it would be interesting to choose between two situations: a small amount of high-quality data versus a large amount of low-quality data. "If we label every example – let's say, a million times – and do a majority vote or other sophisticated crowdsourcing algorithm, receiving way more labels than data, and alternatively have a million examples labelled 1 time. In some of our experiments, we observed that it is often better to collect a large amount of data where each example is labeled a single time," concluded Zack.
Jie Yang suggested that instead of debating the importance of quality and scale, it might be better to concentrate on the points that require focus right now. "I've noticed that the community now pays a lot of attention to the quantity of data, not much to the quality issues. At this point in time, we have seen a lot of unfortunate events caused by failure of the ML systems relating to the quality of data. I don't believe that these problems can be compensated for by having more data, but not everyone in the community realizes that," said Jie.

Recently, the HCOMP community and Facebook asked people to collect data instances that are challenging for machine learning. Still, precisely, what characterizes these challenging instances for machines remains an open question. Can we formulate them right now? Can we say whether human understanding of the difficulties is aligned with that of machines? These questions are worth considering in the here and now.
Ivan Stelmakh agreed with this: "Large language models can learn from everything, but it is hard to guarantee the reliability of every piece of information. Maybe these models could serve as a method to study this tradeoff and to see what they learn from, to evaluate the quality and see how it impacts the performance of these language models."
Mohamed Amgad shared his point of view: "One of the issues will be differentiating between data used for training and validation. In the medical case, volume is the most important bottleneck and affects the generalization of trained models. As for validation (for example to understand the model's performance, or for regulatory purposes), quality is what we are lacking."
Ujwal Gadiraju talked about modern research topics: "It is a hard battle to find an optimal balance between quality and scale. If we consider the broader narrative of moving towards artificial general intelligence, different research communities have been trying to identify what the next obstacles are to tackle in that path. In those deliberations, the problems typically shift to quality-related aspects. To explain what I mean, some of the fundamental issues for ML algorithms these days pertain to their suboptimal understanding of what we know as "common sense", and this is often the source of failures in a real-world setting. Models are trained on predetermined data, which can be biased and incomplete, and we're all familiar with the issues of covariate shifts. Volume of data however, will not always fix the problem and we need to move towards understanding the fundamental issue related to quality. It is hard to find the balance, but I would say that at this moment the scale is less of a problem due to recent advances in how to deal with the lack of data."
By considering Toloka's example, I believe that crowdsourcing platforms will allow us to scale data labeling without compromising quality.
I think that the priority in Toloka tasks is the quality, and through the several months that I've been here it didn't change.
2. Imminent biases in datasets. They come from:
● Data samples
● Guidelines written by managers
● The personalities and background of the labelers
How can we overcome these?
Ivan told the audience that his team recently carried out some experiments where they gave the Tolokers very poor instructions, making them intentionally confusing. They expected that bad guidelines in the interface would lead them to perform the task wrong. But they found that the results they got were still very robust, which means that somehow the performers had understood by themselves how to do the task correctly. This suggests that some of the performers already have the knowledge required for many kinds of simple tasks which they've gained through experience.
"I think the impact depends on whether it is a standard or non-standard task. Of course, if you give a simple task, like to label the images with the classes of physical objects and provide some confusing instructions, people will still be able to perform the task accurately. But what if we give them an unusual task?" said Zack.

In contrast to the previous case, Zach mentioned the example of an unusual crowdsourcing task where the instructions clearly play a crucial role. The experiment is called counterfactual augmented data, where the performers were given the task of editing a text document in order to provide a result with counterfactual labels that differ from the original one. Only the required changes to 'flip' the applicable label were allowed. The authors iterated the instructions many times, trying to make them clearer, and the different set of instructions gave completely different sets of results.

"What do we mean by bias? A statistical meaning would be a sort of systematic error that can be overcome by collecting more samples. There's also a societal meaning: thinking of bias in terms of what benefit or harm is done. All such results will be biased towards each other in a statistical sense; thus, the normative question appears: which of these results is the right one? There is no way to get to this question without knowledge of how this data was collected and what was the decision that was being used to drive it, what the impact is on people. This chain of reasoning expresses the societal concerns associated with data collection."
"I'd also want to focus not on the data-driven way of considering the bias problem but on a more top-down approach: we need to know what data we need. That is very dependent on our understanding of the problems themselve," adds Jie.

"One of our main findings from the surveys was that we realized bias can be a huge problem with data if we are not able to characterize it. And it is very domain-dependent."

"The conceptual level is very confusing. It is very hard to create the kind of guidelines that can get us the right data or even get us the right people to contribute the right data. For example, we recently had a survey of online-conflictual language like hateful speech and abusive language detection. English-speaking people may not share the same perspective, and we need to zoom in and pay attention to small local communities."
Olga agreed that this point really resonates with the practical experience of Toloka. "One of the common tasks used in the industry is content moderation for the company's platform. Such tasks can be used on a very large scale. Overall, it becomes a huge responsibility for the manager who writes the guidelines. People in the industry may not even understand that, because at every stage they leverage the influence, so it goes from one small word in the instructions to application by millions of users. But transparency can help to control these things: other people can see the guidelines and provide the feedback to improve it."
Mohamed considered things from a practical perspective: "In domains where the task is ambiguous or not well defined, no matter how good the guidelines are, there will be some inherent variability. In addition to that, some of the annotators may not read the instructions carefully. These are two factors that should be taken into consideration. In my experience, each time I think about every single scenario, provide troubleshooting or any other techniques I could try. I found it critical to follow up on the first set of tasks during a pilot study, and use those to learn about common pitfalls to update annotation instructions before processing to the main project. This way you realize the things you haven't thought of, and form a general idea about the inherent variability in the task."
3. The last and the most challenging question concerns how to attract experts for data labeling tasks. How can we incentivize people to share real-life experience and expertise in certain areas?
Career/research incentives (recommendation letters, etc.) and financial incentives can persuade experts to spend their time on data labeling.

On this point, Grace jumped in: "Previously in my company, the career incentives approach was realized in hiring a full-time data-labeling team, who were a part of the product development and design system. This way the expertise can be built, and the performers feel more valued by the team."
However, Olga saw significant drawbacks in this approach:

● It's hard to scale and requires significant searching.

● At some point, such a fully dedicated team stops being a group of experts in their domain, as they spend all of their time on data labeling rather than on developing their other skills.
Mohamed suggested a solution to overcome this problem:

● Integrate data collection into specialists' daily practice. For example, when pathologists try to define a diagnosis, they naturally collect some of the data necessary for AI model training. In the coming decades, this will become a common mode of data collection in computational pathology and radiology.

● Assigning Continuous Medical Education (CME) credit to data labeleing tasks. Physicians need a sert number of CME credits to maintain certification. Assigning CME credit to data labeling helps align AI research with career interests of practicing physicians.
"The last point is that we need to think not only about how to use experts, but also how they can gain knowledge from the community, which can potentially be very attractive to increase expert involvement" - Jie.
For Novi, Toloka is not only a place where she can apply her designer's expertise in related tasks. It's also a platform to learn about new skills, such as data analysis and crowdsourcing.
These solutions will surely be part of the coming AI era!