A comprehensive guide to annotating videos for Machine Learning

Chapter 1, or how an exciting data science project turned into a full-time data labeling nightmare

A few years ago, while working at a data science company that analyzes Twitter data, I was tasked with building classifiers for detecting news-worthy objects in Twitter images. Such a project, for a data scientist, is amazing – interesting technology, challenging goals, lots of value, with all the machine learning buzzwords one can ask for. I was excited about this project, and immediately fired up a GPU instance.

 After putting together some classifiers, it was time to see how they performed. One of the tasks on my to-do list was getting a training and testing dataset. I did not think much of it – after all, getting a labeled dataset with a few thousand images is not an engineering challenge, and definitely not a data science challenge. Easy!

A quick google search revealed that…I couldn’t find any relevant dataset for training for my specific problem. Really? Can this be the case? After accepting that reality, I considered “almost relevant” datasets for training. This would be like training a state-of-the-art African lions classifier using house cats images – it should work because they’re all the same, right?


Well.. nope. So, maybe I could buy a dataset? It turns out that it was impossible to find a suitable dataset at a reasonable cost. Next option: hire a team of interns for a month, and have them annotate tens of thousands of images. This option involved a huge overhead: budget, buying equipment, recruiting and training a team, and then unwinding the whole operation a month later – this did not make sense. I spent some time testing mechanical turk, and learned that I would need to create the annotation tools myself, and then carefully monitor the quality of the results.

Meanwhile, I sourced relevant images and evaluated image annotation tools, thinking of creating my own dataset. Soon, I found myself working full-time on a few frustrating tasks: A) evaluating and testing multiple buggy open source annotation tools, B) writing simple python annotation tools, and C) manually annotating hundreds of images.

Each of these tasks turned into a mini-project by itself. My exciting, cool, data science project has turned into months of debugging UI elements and managing large excel files, with weeks of manually annotating images. I learned why data scientists say most of their time is spent on getting the data right.

Raise your hand if you or your team have gone through a similar learning experience…

Chapter 2, where we evaluate video annotation alternatives

Today, with the rise of use of video data, a similar data science project is just as likely to use videos rather than images (you can read more about why annotating video is way superior to annotating images in our previous blog post). If I had to do a similar project today, using video data (building video classifiers and creating training datasets on videos), my experience would likely be even worse. Video has additional layers of complexity in annotating, training, and classifying. In fact, even playing videos is not trivial.

If you’re looking at annotating videos for an ML training dataset, what are your options? Here they are, from worst to better.

  1. Pick a sample from your videos, extract all the frames, and annotate them as images. We recommend you don’t do this, as you’re missing all the benefits inherent to the video format while incurring the cost of annotating a large number of images. Even if using a team of annotators, this approach is not efficient.
  2. Take some videos, get a video annotation tool, and make a personal effort over a few days to annotate them (as videos). Likely, this won’t work. Even one short video can take many hours to annotate.
  3. Use an available relevant dataset for training. Depending on your specific problem and how similar your data is to the available training dataset, this is a great shortcut to take. If you have this option, go for it.
  4. Pick a sample from your videos, get a video annotation tool, hire an in-house/remote team, and annotate them (as videos). This can work. Keep reading to learn about annotation tools.
  5. CaaS. Some vendors (including us, Clay Sciences) will take your videos, and return annotations – we call it Classification As A Service. Just like you don’t want to write your own email client, you may not want to write video annotation tools – it’s a much simpler transaction which lets a data scientist be a data scientist and focus on the important things.

Chapter 3, where we provide advice on choosing a video annotation tool

If you decided to use a video annotation tool, here are the important features to consider in the tools you evaluate:

  • Annotating key frames. There are 1800 frames per minute in a 30fps video, but subsequent frames are usually correlated: you don’t want to (and don’t have to) annotate each.and.every.frame.from.scratch. At a minimum, annotating key frames and interpolating between them is required.
  • Native video format. You don’t want to extract all the frames from the video to be able to annotate them – if your tool needs this step, it is a sure sign this tool is, in fact, annotating images and not videos.
  • Tracking and ML integration. Automated tracking of annotated objects (eg, using optical flow) can save a lot of time in annotations. The ability to use predictions from machine learning models for initial annotations (to be corrected by experts) is another time saver. Tools that utilize active learning, where the annotator is essentially teaching the ML model, can be especially useful (more on that in our upcoming blog post).
  • Consistent IDs. When there is more than one object annotated, objects should have consistent IDs for the duration of the video. This is helpful any time you want to track objects throughout a video, and becomes crucial if objects move in and out of the frame during the video.
  • Distributed annotation. Every minute of a video can take hours to annotate. The option to share the annotation workload among a team of workers is extremely beneficial. The setup and configuration process for each worker, if at all, should be minimal.
  • Segmenting long videos. Another aspect of distributing a large workload is the ability to split long videos into shorter segments (each segment can be annotated by different workers), and then merge the resulting annotations, with consistent quality.
  • Multiple annotators. Compare and merge multiple annotations from multiple workers on the same video segment, to reduce annotation errors and improve quality.
  • Customized labels. Can you customize the annotation tools to add your own labels?
  • Customized attributes. Can you customize the annotation tools to add customized attributes to objects (eg, traffic light color)?
  • Annotation types. Does it have what you need? Bounding boxes (BBOX), lines, circles, dots, 3D boxes.

We have reviewed some of the available video annotation tools, and here is what we’ve found: (heads up: Yes, we do think our platform – Clay Sciences – is the best. But the comparison is indeed objective.)

VATIC VOTT ViTBAT Scalabel BeaverDam Clay Sciences
Annotating key frames YES NO YES YES YES YES
Native video format NO YES YES NO YES YES
Tracking & ML integration NO Partial, but broken NO NO NO YES
Distributed annotation YES + mturk NO NO Partial YES + mturk YES
Segmenting long videos YES NO NO NO NO YES
Multiple annotators NO NO NO NO NO YES
Customized labels YES YES NO YES YES YES
Customized attributes YES NO YES YES YES YES
Annotation types BBOX BBOX, Square BBOX, Point, Group BBOX BBOX BBOX, 3D-Cuboid, Line, Point
  • VATIC: An ancestor to many annotation tools. An older tool that no longer runs out of the box (due to non-backward-compatible dependencies), it is very versatile but can be a pain to manage and you should expect to dive into the python and javascript code – any bugs are yours to keep. Videos are converted into a sequence of jpeg images, and some workers have complained about very long loading times.
  • VOTT: Easy installation, and claims integration with FastRCNN and a form of active learning – but we couldn’t get this to work. In fact, even interpolation between key frames didn’t work well in our tests. Also, it does not have consistent IDs.
  • VitBAT: Somewhat similar to VATIC, can run on native videos and supports points and group annotations. Very configurable (eg you can add temporal attributes), but somehow you can’t provide your own labels to the annotations. It does not have any means to share the workload among multiple workers.
  • Scalabel: Promising tool, but very hard to use and very buggy. For example, we repeatedly tried to annotate one object but ended up with dozens of unique IDs created in some frames.
  • BeaverDam: An impressive UC Berkeley tool, runs as a local django server and can integrate with mturk. However, it’s not clear how to download annotations, and we encountered bugs that prevented normal operation “as is” (it didn’t take us long to find and fix these bugs, but this may suggest caution).
  • Clay Sciences: Our tool! Built from the ground up to annotate videos (and images), with limitless customization capabilities to handle any type and size of data. Can run on our platform with crowdsource workers, or run privately on-prem with your own experts.

About Clay Sciences
Clay Sciences accelerates the process of building machine learning models, providing a platform for data scientists for obtaining training data quickly, efficiently and at scale.
Our web-based annotation tools for video, images, and text can be used on our platform with crowdsource workers, or on-prem with your own in-house experts.

Subscribe to our mailing list

* indicates required