How to get crowd annotations with expert’s quality

The Legend of The Data Scientist’s Fairy tale goes like this:

Once upon a time, in a village not far from here, lived a young girl who enjoyed looking at data. One day, she got a hold on a unicorn-sized trove of magical data, and after a quick look found inside it a huge diamond. The END!

A data scientist and her treasure trove of data. Source:

The reality is that in the vast kingdom of data, thousands of trolls are cleaning each piece of dirt, attempting to reach that elusive diamond. You, the common data scientist, are a low-ranking but brilliant officer in a tiny village in this kingdom. You do not have the $$$ to hire an army of trolls, and you are convinced that you can dig much better yourself with your Spark / R / pandas tools. You know you will get that diamond on your own eventually!

OK, let’s unpack the analogy…

As a data scientist working on supervised machine learning, your first encounter with the messy real world may be after you have finally collected your first substantial dataset, and you realize that you can’t just start using it — first, you need high quality labels.

If you still plan on doing it yourself, budget for a multi-week annotation effort.

Depending on the size of data and complexity of labels, the simplest option may be to just do it yourself – how bad can it be? Sure, if all you need to do is to add binary labels to a few hundred images, you’re likely to be done in less than a day of work. Not too bad. However, drawing polygons around objects in a thousand images will take a significant amount of time. And, because you plan to use a very deep neural network for your new project, your training dataset better include tens of thousands of data points — requiring the annotation of tens of thousands of examples. If you still plan on doing it yourself, budget for a multi-week annotation effort: Probably not the best use of your time. At this point, you’re probably thinking – Surely I can pay someone to do it!

And you’re right: There are several options available. The most basic one is to pipe your data into Mechanical Turk, using the crowdsourcing platform’s vast scale and affordable pricing to label large amounts of data quickly and relatively cheaply. But what about quality and accuracy? After all, asking “random” workers to perform micro-tasks for pennies might attract uneducated, unqualified, unmotivated workers.


With the right error-detection and error-correction mechanisms in place, Mechanical Turk data labeling quality can equal — and even exceed! — that of data scientists and world experts.


After two years of developing data annotation services on top of Mechanical Turk, we know that these concerns are valid. However, with the right error-detection and error-correction mechanisms in place, Mechanical Turk data labeling quality can equal – and even exceed! – that of data scientists and world experts.


Reaching 90% labeling accuracy with Mechanical Turk: A real-world case study

One of our customers, a world-renowned academic research lab, developed a novel visual classification technique of biological events. Applying it to thousands of events is an extremely challenging visual task, repetitive, and physically straining.

The lab had the world’s best (and only) expert on classifying these events. Each experiment the lab ran generated thousands of events, which took weeks to manually classify. With multiple experiments running per month, this was not sustainable.

When we started working with the lab, we incorporated training for this labeling task into our platform. Following a short training and testing period, Mechanical Turk Workers were able to classify these events. Each event was assigned to multiple workers — up to 20 — and the platform labeled the event according to the majority vote.

The results?

  • Mechanical Turk workers were accurate in 90% of cases, compared to a 93% accuracy rate for the expert.
  • However, at almost half the cases where the workers did not agree with the expert, following a thorough review it became clear the expert’s first classification was indeed wrong and the workers were right.

Accuracy isn’t the only factor involved — cost and speed matter, too. Using a combination of automated pipelines, and managing hundreds of workers, Clay Sciences can process whole experiments in a couple of days, at a cost that is a fraction of that of the expert’s time.

It would be fair to say that Clay Sciences has now the world’s second-best expert group on classifying these events (the first is, of course, the expert at the lab).

We are constantly surprised by the quality of the annotations we get “out of the box,” even before we start applying error corrections.


Here’s another compelling example.
When asked to annotate the lane markings for one of our customers, a large car manufacturer developing an autonomous vehicle platform, some workers went the extra mile (no pun intended) and annotated lane markings seen through the back window.

How does it work?

Reading all this, you might be thinking you can use Mechanical Turk for data annotation and get the same level of accuracy “out of the box.” Unfortunately, this is likely not the case.

Here are some of the features we implemented on top of Mechanical Turk to achieve this level of accuracy:

  • We built robust error-detection and error-correction mechanisms. These allow us to detect and discard of errors in the annotations. Similar to error-correcting codes, which allow you to receive clear signal despite a noisy communications channel, we are able to get clear annotations despite some workers making mistakes on some annotations.
  • We choose to employ experienced workers with good reputation. This starts us off at a higher base quality for annotations.
  • We aggregate statistics of worker annotations, and over time identify the best workers for each task. We also reject workers that have higher tendency to make annotation mistakes.
  • We created intuitive and efficient HTML5 annotation tools, which lead workers through the annotation steps with minimal overhead, fewer extra clicks, and fewer errors.
  • We added “gamified” tutorials where workers learn how to use the tools and have to reach a high score to proceed.
  • We broke down complex, multi-step tasks into simple, single-step tasks that can be more easily and accurately accomplished by workers. These tasks then get automatically re-assembled into the original task.


The bottom line: Data labeling on top of Mechanical Turk can be scalable, fast, and cost-effective. However, to ensure high quality annotations, you will need to invest in implementing quality assurance measures (or use Clay Sciences :))



Copyright 2018 Clay Sciences

Subscribe to our mailing list

* indicates required