There's no shortage of startups worldwide who use artificial intelligence (AI) to solve problems. If you search for "this startup is using AI" in Google, you'll discover that startups use AI for HR, crime investigation, road safety, drug discovery, and many things that you might not expect. But what does it mean to use AI in a startup world? And, perhaps more importantly, how do you use AI to solve a particular problem?
Most of the AI you’ve read about is driven by machine learning. This technology allows software to find patterns in large amounts of data. At Lumen5, we use machine learning to solve several problems, including:
- Image cropping. Intelligently adapting background media to multiple aspect ratios, ensuring the vital part of the media is kept in the frame.
- Language prediction. Detecting the language of your content, enabling right-to-left text for languages like Arabic and Hebrew.
- Text highlighting. Adding visual emphasis to important words and phrases in a scene.
Text highlights add visual emphasis to make important words and phrases stand out, but there isn’t an obvious rule-based approach to picking the words to emphasize
We won’t dive into each of these systems’ details, but we’ll refer back to them when we talk about relevant parts of the machine learning process.
A Trip to the Zoo
In recent years, there’s been an explosion of open source code and pre-trained machine learning models. The Model Zoo, Tensorflow Hub, and PyTorch Hub are host to an ever-growing collection of models you can drop into your app to solve various problems. For example, if you want to build a face detector, you can grab a pre-trained object detector and fine-tune it on a face dataset. We did just that for our intelligent image cropper.
You’ll also find pre-trained models for tasks like image captioning, image classification, language detection, question answering, and a host of others.
If you can find a model that does exactly what you want to do, use it. If it does almost exactly what you want to do, adapt it to your needs by fine-tuning. For example, if you’re going to solve the Chihuahua or Blueberry Muffin problem, take an image classifier trained on the ImageNet dataset and fine-tune it with a dataset of chihuahua and blueberry muffin images. This is an example of transfer learning, which uses a model trained for one task and applies it to another task.
Once you’ve picked a model to use for transfer learning, how you decide to leverage that model should be based on the similarity between your task and the task on which the selected model was trained. When you want to train a classifier to distinguish between chihuahuas and blueberry muffins, it will usually be enough to take a model trained on ImageNet, replace its output layer (which predicts 1000 image categories), and replace it with a new output layer that predicts only two image categories: chihuahua and blueberry muffin.
If your task is significantly different, you can still use transfer learning, but you’ll want to use the pre-trained model as a feature extractor and build a larger model on top of it. With very deep neural networks, the deeper the layer, the more task-specific that layer’s output. Early layers learn more generic features that can be repurposed for different downstream tasks.
The advantages of transfer learning
A massive advantage of transfer learning is that you often don’t need a sizeable task-specific training dataset because you’re letting the pre-trained model use what it’s already learned and limiting the scope of what it needs to re-learn for your task. The generic feature extraction layers of a model trained on ImageNet will have already learned to extract useful features from millions of images, so you might only need thousands of images in your task-specific training dataset.
Even though pre-trained models are widely available, and modern frameworks like Tensorflow and PyTorch make it easy to remove layers and build models on top of other models, this isn’t always the right approach. When you’re building features that make your product unique, you often won’t find what you’re looking for in a pre-trained model. In time, models and datasets for your task might start appearing, but if you’re solving a problem that’s more unique at the time, you need to train a model from scratch.
Three strategies to collect and clean data
When you’re trying to gather data to train a model, always remember that it’s “garbage in, garbage out.” If you can’t find good training data, don’t waste effort trying to train a model because it will hurt more than help your users. There are several strategies for gathering good training data.
1. Have users generate data for you inside your app
Human behaviour can often be an excellent teacher, especially when you’re trying to automate a task that your app also allows users to do manually. In Lumen5, a perfect example of this is text highlighting. Highlighting is a subjective task — what exactly makes a good highlight? Learning from highlights that users make in Lumen5 sounds like a good idea. In practice, though, it’s not quite that easy.
Our machine learning-based text highlighter is the third highlighter to make it into production. The first two versions were strictly rule-based systems that did syntactic pattern matching. When we sat down to assemble a dataset to train our model, we could have simply dumped all of the highlights from our database and trained a model on those. But, because many of those highlights were generated by our previous rule-based systems, it made no sense to use those examples to train a model. That model would only be learning to reconstruct the small set of hand-crafted rules that generated those highlights, and we’d be no better off than if we’d left one of those rule-based systems in place.
If you want to train a model that learns from your users’ behaviour, make sure you have a robust way to distinguish between user-generated examples and system-generated ones.
2. Find a third-party dataset
This is often an excellent choice, especially when you’re trying to automate a process for which you don’t have many good user-generated examples. Publicly-available datasets are readily searchable. Suppose you can’t find a dataset that explicitly models the behaviour you’re trying to automate with machine learning. In that case, you can sometimes find datasets with similar inputs and target outputs, even if the outputs have a different purpose.
For example, if you want to learn a text highlighter but can’t find a dataset that is explicitly about text highlighting, you can probably still find a dataset that has strings as input and substrings as output. You can then filter that dataset so that the examples on which you train your model most-closely resemble the text highlights you want to your model to learn.
Be mindful of licensing when you use third-party datasets because not all datasets are free for commercial use.
3. Build your own dataset
Building a custom dataset from scratch is often impractical. Although it gives you full control over the behavior you want your model to learn, it takes a great many person-hours to create a large enough dataset of training examples manually. Suppose you feel you have to do this, leverage crowd-sourcing platforms like Amazon Mechanical Turk. Crowd-sourcing can introduce data quality problems, so be ready to do some aggressive filtering to remove low-quality examples from your dataset.
Want to work with the Lumen5 engineering team and solve problems using machine learning? We're hiring!