We recently held a webinar with Dr. Vera Dvorak, Machine Learning Operations Manager at Yext. In her role at Yext, Dr. Dvorak oversees the data annotation teams, ensuring that the data science team has all the labeled data they need to complete their projects. The webinar is full of insight and helpful information (you can check it out here) but we’ve pulled out a few key takeaways for you.
“My role is to be a link between the Data Science team on one hand and our annotation teams on the other hand. When it comes to the data science team I really am responsible to respond to all their labeled data needs, like the data that they need for both training and retraining language models. And this can be both for long-term projects…and also short-term projects when you just need a limited set of labeled data to test some idea and then maybe discard it or go on further with it and add more labeled data so that's one part of my work: to respond to all of that to be able to brainstorm and and will supply what is needed.
“At the same time, I also need to manage annotators in the sense of making sure that they have enough data to labe, that they understand the guidelines, that they know how to use the labeling too, and also very importantly that their questions are answered very swiftly so they don't have any blockers which would prevent them from labeling correctly, efficiently, and at a steady pace.”
“Label Studio provides very clear annotator feedback loops which are very transparent. They allow people both to leave their own comments but also to mark things for escalation and for discussion. We use comments on three different levels:
“[I like] the flexibility of the labeling interfaces themselves. You don't need to be a programmer to add labels to or remove labels to play with how things are. I'm very picky about how annotators see things. I want to make things very condensed so even if something is imported and I'm not happy with how it looks, I can play with it. Also, based on the feedback I get from the annotators I can make fonts larger or smaller, for example. This is very easy to do and I really appreciate that.”
“When people start they usually need some time to get used to a task, to maybe even research and go back to the guidelines. But then over time they should improve and then we compare them [with the other annotators] But it doesn't mean that the fastest annotator is always the best if they are not as precise. So you also need to take that into account. I always tell them it needs to be this balance between quantity and quality. You don't want somebody who is super slow and doesn't do much, but you also don't want someone who is super fast and then overlooks things. As you know, ‘garbage in, garbage out.’ I always say it's not lots of labeled data that helps you, it's lots of high quality labeled data that really gets you somewhere.”
“You always get comments asking for clarification but I think the strategy that I apply could be sort of compared to a funnel. So at the beginning, there are many questions about different things but one by one people start labeling and they mark things they want to discuss that they don't understand - and it could be 50% of your data at the beginning or even more - and you go over them, but then you see it's repetitive. There is a pattern. You narrow that down more and more, and in a good labeling project when the task is clear, you see this funnel effect where it's less and less questions that you have problems with, and in the end there are basically no questions…If that's the evolution you have, I would say that means your labeling is going well.”
This is just a small sample of what’s available in the whole webinar. You can watch the recording here.