Why We Built an Open Source Computer Vision Toolkit
Over the last few years, the Machine Learning (ML) landscape has changed dramatically. The comeback of neural networks in the form of Deep Learning has opened new creative ways to approach many classical ML and Artificial Intelligence (AI) problems.
Out of all the areas vastly improved by Deep Learning techniques, computer vision has been particularly revolutionized by major breakthroughs that have only very recently occurred.
The Deep Learning revolution in computer vision
It all started when a new model published in a paper by Krizhevsky et al. — known as AlexNet — won the ImageNet challenge in 2012. This challenge had been running for quite some number of years, and one parts consists of an image classification task. In it, algorithms are given images and must classify them into a fixed set of categories. The training set consists of 1.2 million images, each of which can be labeled into 1000 categories. For the evaluation, the algorithm must decide the categories of images which it hasn’t previously seen during training.
This new AlexNet model could reduce the error rate from 25% in 2011 to a mere 16%, greatly improving the results of prior state of the art techniques which, at the time, still relied on handcrafted features such as HOG or SIFT. Subsequent years continued the trend, and we quickly saw the error rate drop to less than 3% — often surpassing human performance in the narrow task of assigning a category to images. Deep Learning was here to stay, image classification was “solved” and this xkcd comic was obsolete.
What changed? It is interesting to note that conceptually, AlexNet was almost equivalent to a model called LeNet-5, proposed by Yann LeCun (now Director of AI Research at Facebook) almost 15 years earlier, used back then to recognise handwritten digits. However, LeNet had roughly 2 orders of magnitude less parameters (AlexNet had around 60 million), and it had a much smaller dataset on which to train (MNIST). Advances in modern GPU technology, added to the availability of larger open datasets (such as ImageNet) have really driven algorithmic advancements which ended up significantly improving accuracy in almost every area where Deep Learning has been applied.
In particular, both LeNet and AlexNet belong to a family of neural network architectures called Convolutional Neural Networks (CNNs or ConvNets). These networks have a number of properties that make them really well suited for image processing tasks. CNNs use a layered structure, in which each layer has the ability to slide different filters through the output of the previous layer. This allows the networks to learn patterns of increasing level of abstraction, until they can grasp complex concepts that would otherwise be hard to express. For example, it has been shown that lower layers learn simple features like edges and color gradients, while higher layers can learn more abstract features like eyes or faces.
On object detection in images
The power of CNNs has also been applied to object detection, dramatically improving results over previous methods. In object detection, we aim to identify a variable number of objects present in an image, and for each one, classifying it into a fixed set of categories and also drawing a bounding box that encloses it. This is a much harder problem than image classification, and so far it has not been “solved”.
This problem has plenty of applications, ranging from simple counting any type of object (like people or cars), aerial image analysis, visual search engines or health care, where it can help with analysis of medical images. Did we also mention self driving cars?
One can arguably say that with the recent explosion in Deep Learning frameworks (Google’s TensorFlow, Facebook’s PyTorch or Caffe2, Microsoft’s CNTK, Amazon’s MXNet, among many others) and the tools built around them it is fairly straightforward for most programmers to implement state of the art image classification techniques into applications. However, object detection is a tougher nut to crack.
Object detection in the industry
Over the last couple of years, we have identified some issues most companies face when incorporating these technologies into their platforms.
Intense learning curve
Even for people with tons of experience (and even with PhDs in the field!), the developments in the space have been evolving so quickly that it’s very hard to keep up with the pace of recent advancements and new techniques.
Moreover, many implementations of object detection techniques come from the academic world, where they were conceived as research prototypes. and not really intended to be used in production applications. This means that configuring them and modifying them can be challenging.
While there are some high level abstractions such as Keras, it is not always straightforward to apply them directly to the latest research findings. At the cost of simplicity, abstractions end up limiting what you can customize.
SaSS solutions are not always applicable
Many large companies have developed their own SaaS API solutions, like Google’s Vision API, Microsoft’s Computer Vision API, Amazon’s Rekognition, among others. Those are super simple to integrate into most applications and sometimes offer great solutions. However, these APIs usually only work for a predefined number of tasks/classes/object types like detecting faces, cars, dogs, cats, etc. While these are good for a broad range of applications, it is very common that companies have their own needs specific to the type of problem they are trying to solve. What if you are manufacturing company and want to identity defective pieces in your production chain? Or if you work in health care, and want to detect certain patterns in medical imagery? In these cases, you need something that can be trained with any dataset.
There are services that even let you upload samples and do the training in the cloud. However, sometimes you want to be the owner of your data model and deploy in your own servers, and not have an important part of your core business tech hosted outside your platform, using a Cloud API provider to whose model (or implementation) you have no real access to.
Besides, depending on the volume of usage you expect, SaaS might just become too expensive or the latency in transmitting images (which require high bandwidth) will be prohibitive for your application.
Creating a functional implementation from a research paper is damn hard
Even if you are an expert in any of the Deep Learning frameworks, coming up with something that actually works is pretty damn hard. Small implementation details have no room in academic papers, and sometimes they can and do make a huge difference in the results. Moreover, many things can be implemented in several ways, and these might in turn also affect your results.
Bridging the gap: birth of Luminoth
After identifying these issues and realizing that ourselves at Tryolabs ended up rewriting many of the common TensorFlow boilerplate code and models over and over for our clients, we started to think about the need for factoring out some code, and how an ideal toolkit would look like.
As a result, we built Luminoth, an open source toolkit for computer vision. Currently, we support object detection and image classification, but we are aiming for much more. It is built in Python, using TensorFlow, and Sonnet, a very useful library built by DeepMind for building complex neural networks with reusable components.
Luminoth’s main strengths
– State of the art algorithms: currently, we only support image classification and Faster R-CNN model for object detection, but we are actively working on providing more models and keeping up to date with the latest research in the field.
– Open source: Luminoth is free and open source. You can download it, customize it for your needs and integrate it into your product or cloud. You can also collaborate with it!
– Developer friendly: we’ve poured our experience of working on bridging the gap between academic Machine Learning findings and production ready software to make Luminoth accessible. We strived for an easy to use interface, beautiful code with comments, and unit tests. Of course, there is still a lot of room for improvement, and things will get better as the toolkit gets more mature.
– Made with TensorFlow and Sonnet: production ready, reliable, robust and maintained frameworks.
– Customizable: you can train Computer Vision models with your own data. You are not limited to existing datasets such as COCO or ImageNet.
– Cloud integration: we strived for a super simple Google Cloud integration, specifically ML Engine. This means training distributedly is very straightforward: there is no need to buy those GPUs.
We expect Luminoth to be helpful both to developers wanting to integrate Deep Learning based computer vision algorithms into their products, as well as researchers who want to experiment with new models and techniques while skipping tedious boilerplate code.
Feel free to explore our GitHub repository. Feedback and contributions are more than welcome!
Alan is the CTO of Tryolabs. He holds a Computer Engineering degree and +7 yrs of experience developing robust backends, infrastructures and Machine Learning based algorithms. Currently, he’s the main consultant for every project developed and is an active member of the R&D team. A Python expert with deep understanding of Machine Learning related technologies.