Guest Blog

Deep Learning Infrastructure at Scale: An Overview

Many commercial applications of Deep Learning need to operate at large scale, typically in the form of serving deployed models to large numbers of customers. However inference is only half of the battle. The other side to this problem, which we will address here, is how to scale ML model building and research efforts when training datasets grow very large, and training must be done in a distributed fashion in the cloud. At Intuition Machines, we often need to deal with web-scale datasets of images and video, and as such having an efficient, scalable multi-user distributed training platform is essential.

This post encapsulates our recent review of the field, and is intended to serve as an introduction to the principles of training at scale along with a brief overview of the best available open source solutions for training your own networks in this manner.

WHAT ARE THE CHALLENGES WITH SCALING UP DEEP LEARNING?

As ML projects move from small scale research investigations to real world deployment, a large amount of infrastructure is required to support large scale inference, efficient distributed training, data ingest/transformation pipelines, versioning, reproducible experiments, analysis, and monitoring; creating and managing these support services and tools can eventually constitute much of the workload of ML engineers, researchers, and data scientists.

The portion of ML code in a real-world ML system is a lot smaller than the infrastructure needed for its support.

Source: Sculley et al.: “Hidden Technical Debt in Machine Learning Systems”

How can we enable these teams to focus on training the best models and delivering the best solutions without reinventing the wheel each time or getting bogged down in technical debt? (Sculley et al. discuss some other challenges in scaling ML systems, such as Glue Code, Pipeline Jungles, Reproducibility and Process Management Debt). We will take a look at some of the emerging infrastructural solutions to combat these issues below.

These days it is common to use cloud compute services such as Amazon AWS, Microsoft Azure, or Google GCP for handling all kinds of workloads. On the inference and data processing side, we can efficiently handle massive jobs in parallel using clusters of worker nodes and services co-ordinated with Kubernetes. We can be agnostic to underlying hardware and deploy on different platforms, using containerization to manage installation requirements.

For training, Deep Learning packages such as TensorFlow and PyTorch have added native support for multi-GPU training. This may be straightforward enough to deploy on a small scale on dedicated hardware, but when dealing with multi-node distributed training additional work is required. Another big requirement is portability: the same code should run on a laptop, a GPU rig, or a cluster, and even automatically scale based on demands. It is possible to hand-craft scripts to orchestrate deployment across a cluster, but this is cumbersome. However there are now open-source solutions that will enable packaging, containerization and automated deployment and co-ordination of workers and jobs. We will look at some in the next section.

Beyond distributed training of a single model, there are other requirements to make a truly efficient, repeatable, and flexible pipelined system for the whole ML workflow. When researchers are tackling new problems and investigating new models, it pays to have infrastructure in place that supports some of these tasks in order to streamline the process:

Data ingest, serving, versioning, preprocessing, splitting, transforms / augmentations
Job packaging and deployment; job reproducibility
Experiment monitoring/logging; comparison of results across runs, status dashboards
Automated hyperparameter tuning and model architecture search
Rapid deployment from training to online model serving / inference, and A/B testing
Support a wide variety of Deep Learning frameworks and other libraries in an integrated fashion, to avoid imposing limits on researchers enabling diverse solutions to be tried

Overall, the goal is to iterate and search over a variety of methods, models, features, and data pipelines in a manageable, repeatable, and measurable way, rather than trusting the more ad-hoc approaches that are common to many small research labs. Ideally after we have set up the processes for managing these components, the whole process becomes straightforward for researchers or data scientists to use the deployed environments in a self-service fashion, and to work collaboratively sharing a common resource pool of cluster nodes.

Some core architectural components of the ML workflow.

There are a large number of commercial solutions being touted to manage the ML lifecycle, such as RiseML, H20 Driverless, Anaconda Enterprise, Databricks, Paperspace, Domino Data Lab, ParallelM, neptune.ml, and comet.ml. Many of these platforms are also built around Kubernetes. There are also useful hosted platforms available on the major public cloud services, such as Microsoft’s Azure Machine Learning Service (formerly BatchAI), Amazon’s SageMaker, and Google’s Cloud AI, which help with some of these requirements.

Many large organizations have also developed their own private internal solutions to these problems, for example FBLearner Flow (Facebook), MichelAngelo (Uber), TFX (Google).

However, we are big fans of open source (having open sourced portions of our hCaptcha system and HUMAN protocol) and contribute to dozens of open source projects that we use internally. Therefore rather than using a hosted commercial service, or developing our own from scratch, our preference was to see what open source equivalent solutions are emerging in the ML infrastructure space. There are a variety of requirements here which may be handled by a single package, or a combination of several, which we will describe next.

SOME OPEN SOURCE ML INFRASTRUCTURE SOLUTIONS

JupyterHub

The Jupyter notebook paradigm has become popular among researchers and data scientists for rapidly iterating (especially Python) code, visualizing results, and experimenting with different models. JupyterHub allows hosting multiple users with their own notebook servers; it is designed to be flexible, scalable and portable, and can be deployed with Kubernetes.

KubeFlow

A project that originated at Google, Kubeflow began with the aim of making it easy to run TensorFlow jobs on Kubernetes, and is expanding to support a whole stack of other ML and infrastructure tools. These include:

TFJob – a resource/YAML file spec describing how to run a (possibly distributed) TensorFlow job on Kubernetes
Argo – Build container-native workflows on Kubernetes (every step of the workflow is a container). For example, build reusable containers automatically from trained models.
Seldon-core – model deployment via Docker; convert to microservices with REST/gRPC APIs; A/B testing; and manage deployed inference scalably.
TF Serving – another library serving models for inference, A/B testing, model versioning
Istio – metrics, A/B testing and integration with TF serving
Katib – Black Box Hyperparameter tuning, in the vein of Google Vizer
Ksonnet is used as an alternative to Helm for Kubernetes package management.

At the Deep Learning framework level, Kubeflow has been adding support for MXNet, PyTorch and Chainer as well as TensorFlow. Jupyterhub is also available as a component.

There are clearly a large number of separate components which are really standalone projects, with some overlaps and some gaps. The Kubeflow project, while more or less usable at the time of writing, is still under heavy development and requires a large investment of engineering time into understanding the system, configuring it for your needs, and often patching a few areas you need but no one has touched recently to make the most of everything in a truly integrated fashion, and to support multiple users dynamically.

Our view is that there is a lot of vision here and hopefully the pieces will come together rapidly as the project evolves. However, for larger organizations with distributed systems expertise it brings less benefit than you might expect due to the immaturity of the codebase and heavy churn. Without having a Kubeflow maintainer on staff, using it for non-trivial work may be difficult to justify and will certainly require a substantial time investment.

Polyaxon

Another integrated platform built on top of Kubernetes, that has an emphasis on reproducible ML at scale, in a simple and accessible way. It is somewhat more aligned to the goals of a self-service multi-user system, taking care of scheduling and managing jobs in order to make best use of available cluster resources. It also handles code/model versioning, automatic creation and deployment of docker images, and can support auto-scaling.

It provides features such as distributed model training, experiments, dashboards with visualization, metrics, and easy hyperparameter optimization. Polyaxon supports various DL frameworks including Tensorflow, Keras, MXNet, Caffe, PyTorch. Currently there is no built-in support for model serving, with more of an emphasis on experimentation and inference, but new features are being added rapidly.

Compared to Kubeflow, Polyaxon is very focused. Polyaxon doesn’t attempt to cover realtime serving, istio integration, has no ksonnet requirement, and isn’t tightly tied to seldon. Rather polyaxon is more focused on running adhoc or repeatable experiments. It does handle all of the package management and can automate notebook and tensorboard installs. Polyaxon solves the problem of running disparate jobs on the same cluster and it supports distributed training as well. It has a built in job management dashboard, as well as simple user management.

Polyaxon sits on top of Kubernetes services to provide an easy-to-use, repeatable ML experimentation and training platform, supporting dashboards, versioning and much more.

Source: https://docs.polyaxon.com/experimentation/concepts/

FfDL

IBM’s Fabric for Deep Learning (pronounced “fiddle”) is a Kubernetes based platform designed around elastic, cloud based distributed training, and abstracting some of the complexities of the underlying infrastructure to move towards “Deep Learning as a service”. It supports many of the main DL frameworks – Tensorflow, Keras, Caffe 1&2, and Torch, and makes use of Jupyter, Docker and Helm, and is designed around S3 based data and model storage. It provides a Grafana dashboard for monitoring jobs, and can use Horovod for distributed training. Manifest files are used to describe a model to be trained, hyperparameters, resource requirements, metrics, etc. REST APIs are used to interface with a gRPC Trainer, which stores jobs in a database, and handles scheduling multi-user training jobs across their life cycles, as well as parameter servers. There is a training data service to manage metrics and logs. For model deployment, integration with Seldon is supported.

Architecture of FfDL

Source: https://github.com/IBM/FfDL

Lore

This python framework was recently open sourced by Instacart, and is more of an easy to use workflow solution for the ML training and inference pipeline. It supports hyper-parameter search, data extraction, encoding and splitting pipelines, database connections, dependency management, CI/CD testing, model deployment, and integration with packages such as Keras, Tensorboard, Scikit-learn and Jupyter. One key idea is to standardize ML practice across different libraries, with a set of wrappers. There is perhaps more of an emphasis towards time series/textual data than video and images and CNN model based Deep Learning, and it doesn’t get into the world of containerization and cluster management.

As such, it is perhaps a good entry point for newcomers to the field who look for some of the consistency and workflow solutions we have been describing, but it perhaps lacks some of the more advanced features that are useful for an organization doing large scale distributed training and architecture search on visual domain data.

ML Flow

An open source library created by Databricks, designed for managing 3 stages of the ML lifecycle: Tracking, Projects, and Models. The goals of the project are to improve reproducibility and enable experiment tracking, and make it easier to move models to production. It is also designed to support any Python or R based ML framework, has a web UI to make tracking straightforward, and has multi-user support by running a tracking server. Tracking supports metrics, visualizations, storing artifacts, and comparing runs, as well as hyperparameter tuning. Projects support a simple way of packaging code, run parameters, and dependencies via YAML configuration files and Conda environments, and integration with Git. Projects also support scalable dataset I/O with hosts like AWS S3. Models can easily be deployed in several ways, including as a REST server, or exported into a Docker container,

Overall, it is easy to get started with a single-user local setup, but still has scalability to support larger datasets and teams. Distributed training or cluster support is not part of the functionality however, so additional work will be required here.

StudioML

This Python package from Sentient is designed to manage ML model building experiments, covering some of the same ideas as ML Flow or Lore, without deployment, but with some cloud-based training integration. It supports the most common packages – Keras, TF, PyTorch, and scikit-learn. Its main features include environment and dependency management, monitoring/logging, experiment management, hyperparameter search, artifact management, and containerization. One of its unique features is it has direct integration with Amazon EC2 and Google Cloud APIs, as well as auto-scaling to make launching its training jobs on these platforms straightforward.

Bighead

Airbnb recently presented their new open-source end-to-end ML pipeline library. As of the time of writing it has not yet been released, but looks very promising. It covers features across the whole workflow from data ingest/transformations, training, model management, hyperparameter tuning, monitoring and deployment. It is designed to be modular and support common ML training frameworks. There is support for built-in data transformations and data visualization tools, to help with one of the parts of the ML development process where researchers and engineers spend most of their time. Its Redspot environment is an extension of JupyterHub that helps with instance and package management, and built around automated Docker container services, which helps with the training workflow and multi-user environments. Model management keeps model code and trained models in sync, and improves reproducibility through common metrics. Production supports scalable streaming and brings consistency between training and deployment environments.

There is a lot of promise here, though we will have to see how everything behaves in practice once released.

Data

On the data ingest, processing, management, and retrieval fronts, there is a lot more than can be covered in detail here, in terms of how to effectively deal with training datasets once their scale outgrows what can be stored on a single node. For now we are making use of custom solutions in these areas, but there are a couple of packages worth pointing out.

Pachyderm

The ability to transform data using different pipelines is an essential part of the ML process. In experimentation, data engineering is often as if not more important than model design. As models and data evolve in parallel, many organizations face the problem of repeatability and versioning of datasets along with models. But it’s not possible just to store huge datasets in git. So packages such as Pachyderm enable data provenance, i.e. where did the data come from, and how has it been modified along the way. As data is ingested from different locations, cleaned, augmented, and transformed with different methods, version control can be applied, so that experiments can reveal if it was changes to the dataset or to a model that caused an improvement. Some of this may be seen as overkill in certain workflows, but with data being a key part of the whole ML process, it’s worth thinking how to integrate these approaches with experiment tracking, for instance. Pachyderm is also built around Kubernetes and can be integrated with Kubeflow.

Pachyderm allows data versioning and persisted model workflows.

source http://docs.pachyderm.io/en/latest/cookbook/ml.html

Solt

A library for image data augmentation. Often necessary in Computer Vision tasks, the ability to apply various parametric transformations to an image dataset can be a bottleneck in training. Being able to do this in a fast and streamable manner is a useful component.

OpenFaas

In keeping with the trend of using Kubernetes and Docker based microservices, the idea of Functions as a Service is a scalable way to create customized data processing pipelines from existing functional components, in a Serverless fashion. We have been investigating these approaches and integrating with other systems; this appears to be a flexible and robust way to handle large scale data processing requirements for both training and inference.

Distributed training

The solutions described so far help with managing training workflow, but taking advantage of additional compute power that a cluster can provide to give faster training results requires some changes to training code in order to transfer weight updates in stochastic gradient descent between workers. There are a couple of solutions that avoid having to hand-roll some of this process.

Distributed Tensorflow has capabilities for using a parameter server to manage distributed jobs. This requires specification of a ClusterSpec, containing IP addresses/ports of servers where the tasks should run. Projects like Kubeflow (via TFJob files) or Polyaxon can help set this up, on top of a Kubernetes cluster. This still requires a degree of interaction with the cluster setup itself. Some additional scripting is required for example to automate packaging.

Horovod

While the approach used by TensorFlow is typically to use a parameter server in order to manage

weight updates in a distributed training scenario, Uber’s Horovod library takes things to another level of performance and ease of use. It uses a different manner for averaging weights’ gradients across workers. Rather than all weights being averaged through a central parameter server, workers exchange only a portion of their gradient updates with their neighbors using a ring-allreduce algorithm, that makes much more efficient use of network bandwidth. It uses a Message Passing Interface like Open MPI, as well as using NVIDIA’s NCCL library to optimize these transfers both on a single Multi-GPU machine, as well as on a cluster.

Horovod uses the ring-allreduce algorithm to enable efficient distributed training without the need for a centralized parameter server

Source: https://eng.uber.com/horovod/

In terms of ease of use, there are only a few commands to inject into standard training code in order to support training in this way – standard optimizers are wrapped in a custom distributed version. Support was available for Tensorflow and Keras originally but now PyTorch sessions can be distributed in the same way.

Horovod has already proved itself in providing an effective enabling technology, and we would be excited to see it be natively supported by more of the ML lifecycle solutions.

Our use

At Intuition Machines, we are not strangers to the challenges of Machine Learning (ML) at scale. We have become experts at providing very large scale, efficient inference and data retrieval solutions to our clients, with a focus on Computer Vision, Images and Video.

For these tasks, we currently bootstrap our Kubernetes in Azure, and we use a combination of GPU nodes and virtual nodes to scale our laboratory cluster. We are integrating Polyaxon, with basic single sign on to manage our experiments, having easy access to the cluster. This provides a simple mechanism for users to start with a Jupyter notebook, and quickly transition it to distributed, large scale jobs, complete with experiment tracking and hyperparameter searches.

On the data side, we considered Pachyderm, which is more of a Hadoop rebuilt from scratch, and comes with some of the same downsides. Although Pachyderm is focused on “the data” it has its own mechanisms for processing this data. We currently already have a preprocessing pipeline, and we also use OpenFaas and an in-house system called Mongoose to process our large datasets resident on S3; integrating Pachyderm’s processing system so far didn’t buy us a lot.

Summary

Many of these packages, libraries and solutions have appeared during the last year, and most are still in Alpha or Beta stages. We expect to see them develop rapidly over the coming months. However, many are already in a state where they are useful for improving the real world workflows that many ML teams are facing. Larger organizations with more sophisticated requirements (especially on the data ingest side) will most likely benefit most from picking individual pieces to integrate into their preferred workflows rather than adopting opinionated and early-stage bundled solutions like Kubeflow.

Broadly speaking, we see that some of these solutions are designed with cloud-based training in mind while others are focused more on making the end-to-end ML workflow simpler for a single user system or a small research team. We are particularly excited to use solutions based around Kubernetes for the ability to make dynamic use of cloud resources, and efficiently deploy distributed training on a large scale.

However, smaller teams without in-house DevOps talent may be best served by choosing one of the simpler hosted solutions like Azure Machine Learning Service or Google Cloud ML Engine at the moment. Larger shops with more complicated needs are likely to find that the complete open source systems available are still very much in progress, and so picking and choosing components that add value on their own to plug into existing internal infrastructure will often be the most practical route for the near future.

Forecasting Consumer Healthcare Journey with Recurrent Neural Networks

Use of artificial neural networks for machine learning has enabled major advancements in intelligent systems, helping millions of people in their daily lives. During the past decade, progress has greatly accelerated thanks to the availability of massive amounts of data and use of specialized hardware to build deeper networks and perform faster optimization. Furthermore, better insight into the inner workings of deep neural networks has enabled both researchers and practitioners to achieve improvements in training and generalization (Erhan, 2010; Ioffe, 2015; Srivastava, 2014).

There are numerous environments where systems powered by artificial neural networks shape our experiences and influence our behavior. These systems routinely manifest in our experiences with e-commerce, web search, as well as in communication interfaces such as smart speakers, messaging, and email applications.

An important area where the use of machine learning is still in its infancy is population health. While deep learning has been used for medical diagnosis applications (Poplin, 2018; Cruz-Roa, 2014), building predictive models for behavior of healthcare consumers is a relatively unexplored subject. This is a potential use case that we are passionate about at Accolade.

Accolade at a Glance:

Our mission at Accolade is to provide personalized health and benefits solutions to improve the experience, outcomes, and cost of healthcare for employers, health plans, and health plan members. We provide a single point of contact for all health and benefits resources and work with employees and their families to help them utilize the best care options available. Our ability to be proactive about consumer behavior has always been crucial to our mission.

Predicting Members’ Healthcare Usage with Deep Learning

People pursue and obtain healthcare through various channels. For instance, they can visit primary care physicians or specialists, and they may receive care at clinics or hospitals and fill prescriptions at drugstores. However, while they often seek information to help in their decision-making from the internet, friends, and providers, choosing the right healthcare and using it properly has become an increasingly challenging and complex task.

In addition to these conventional methods, Accolade members can call our team of healthcare assistants or reach out to them through direct messaging. Furthermore, our technology enables informing our health assistants about changes in members’ health status that may require support and guidance. We consider all these as other forms of interaction between our members and the healthcare system.

By drawing on what we know about how our members use healthcare and related benefits, we have considered building models to predict members’ future usage patterns. This provides our team of health assistants with valuable insight to use in outreach and guidance. Such targeted interventions improve members’ health outcomes and their decision-making about using health and benefit resources, which in turn saves medical costs.

When it comes to learning from our members’ experience over time, events are not isolated from each other. Occurrence of a healthcare event can generally be traced back to a prior event. Let’s make this concrete with the following hypothetical scenario. Fig. 1a) shows a series of events that an Accolade member might experience over time.

Figure 1 a) Sequence of a member health events over time. b) An LSTM network learning from the sequence of events in a). {yi} are labels corresponding to the events whose feature vectors are {xi}.

Here, the member visited a primary care physician (event #1), who referred him/her to a specialist (event #2). For the purpose of diagnosis, the specialist then asked the member to take medical tests (event #4). However, in the meantime, the member decided to consult his/her dedicated health specialist at Accolade (event #3). The member then returned to the specialist to discuss the results (event #5). Other events may follow.

Clearly, most of these events are result of other events that happened earlier in the member’s timeline. For example, the lab visit was requested by the specialist, to whom the member was referred because he/she visited a primary care physician in the first place.

Furthermore, there is some amount of data that describe the context of each event. For example, there are diagnosis codes in specialist claims or lab visits, and procedure codes associated with operations or tests performed on members in medical facilities. Combined with member attributes (age, gender, family information, location, employer, etc.), these form comprehensive feature vectors {xi,i=1,…} describing individual members and the events they experience as they navigate through the healthcare system.

Having identified event sequences and feature vectors describing each event, we use recurrent neural networks, Fig. 1b), to learn the underlying trends in the members’ healthcare journey. This enables us to make informed predictions about what is likely to come next in the members’ interaction with us or the healthcare providers.

Recurrent Neural Networks

Recurrent neural networks (RNNs) are at the forefront of neural network models used for learning from sequential data. Examples are time series problems and natural language understanding tasks such as machine translation and speech recognition (Cho, 2014; Graves, 2013). What makes RNNs powerful in dealing with sequential data is their stateful design: RNNs have number of internal states that are updated as consecutive elements of a sequence are processed. These internal states are then used, along with current input, to predict sequences of outputs. This gives rise to a model whose individual predictions, in addition to the current observation, are influenced by sequence of prior observations.

RNNs come in different flavors that generally differ in their details of internal computational steps that connect their inputs and outputs. In our case, since sequence of member events can be quite long, we used LSTM (long short-term memory) networks that are designed to handle long-term dependencies (Colah, 2015).

This model is currently used for the following applications:

Identifying High-Cost Claimant Members

One of our mandates at Accolade is to help our customers manage the healthcare spending of their employees. Employers often incur inflated medical costs owing to employees who are heavy users, usually because they make frequent visits to healthcare providers and/or have expensive medical claims. Identifying those people enables our health assistants to engage with them early on to provide guidance, ensure they use their healthcare and benefits properly, and inform them about alternative options available to them through their health plan.

We use RNNs on sequences of our members’ historic claims to predict whether a given member is likely to become a high-cost claimant in a certain time period, for example by the end of the calendar year. The resulting model is periodically applied on existing medical claims data of individual members to give the probability for a member becoming a high-cost claimant later on in the year. This enables Accolade to identify future high-cost claimants and reach out to them before they actually incur such increased costs.

Forecasting Member Interactions

Calls and/or direct messages are another type of event making up sequences of longitudinal health data of Accolade members. These interactions are two of the primary methods of communication with our members. As described earlier, interactions with Accolade are interrelated with claim events. For example, members contact Accolade to inquire about their past or upcoming medical claims.

We train an RNN-driven model on sequences of member claims and call events, in order to predict the probability that a member will contact us in any given time period. If more members are predicted to have higher likelihood of calling Accolade, bigger call volumes can be expected. Anticipating this volume enables us to be proactive about members’ healthcare and benefit needs and plan accordingly for our own staffing requirements.

Works Cited:

Cho, K. e. (2014). Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. EMNLP (pp. 1724-1734). Doha: Association for Computational Linguistics.
Colah, C. (2015). 1. Retrieved from github: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Cruz-Roa, A. e. (2014). Automatic detection of invasive ductal carcinoma in whole slide images with convolutional neural networks. SPIE Medical Imaging, 904103–904103.
Erhan, D. e. (2010). Why Does Unsupervised Pre-training Help Deep Learning? JMLR, 625-660.
Graves, A. a. (2013). Speech recognition with deep recurrent neural networks. International Conference on Acoustics, Speech and Signal Processing (pp. 26-31). Vancouver, BC: IEEE.
Ioffe, S. S. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. JMLR, 448-456.
Poplin, R. e. (2018). Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nature Biomedical Engineering, 158–164.
Srivastava, N. e. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. JMLR, 1929-1958.

My Idea for Bringing Artificial Intelligence (AI) to Airports That Someone Should Go Execute

Inspection of carry-on bag scans for security at airports is currently manual. This results in long wait lines, frustrated travelers and susceptibility to human error. At present, Computational Tomography (CT) is used to scan carry-on bags for items and then a security officer visualizes each of the scans and decides whether the carry-on has any unsafe items in it or not. Manual visualization of the scanned images takes time, and, since it’s a manual process, chances of human error is high. There is also no uniform analysis of scanned images because of varying decision making from one security officer to another.

A New Security Screening Solution

So how about we try something new? We could incorporate an Artificial Intelligence (AI) based solution in current CT scan machines at airports, which could automatically distinguish between safe and unsafe items from the scanned images of carry-on bags at airports.

The major advantages of this system:

Shorter wait lines and less frustrated travelers. AI based automated recognition of unsafe carry-on bags will make the carry-on security check process faster and help reduce traveler frustration. And when passengers are not waiting in lines, they might be at restaurants or shops at the airport, which could add to revenue generation.
Higher accuracy and better security because of reduced human error. A high level of accuracy will be possible through the use of machine learning algorithms as well as the regular upgrade of the classifiers used by those algorithms as labeled data becomes available for re-training them. Once the algorithms are re-trained, their deployment on the CT scan machines in the field can be done cheaply and quickly without disrupting the field operations in a significant way.

So, now you might be thinking, that sounds a natural progression, hasn’t anyone already researched this field? It’s not that research has not been done in this field. The U.S. Department of Homeland Security (DHS) has teamed up with Google and its crowdsourcing site, Kaggle, to search for new algorithms to identify concealed objects detected by airport security body scanners. The problem that DHS and Google are trying to solve here is not exactly same as mentioned in this blog but is a similar concept of using AI algorithms to detect concealed dangerous objects on the human body rather than in carry-on bags.

There have been CT scan machine manufacturers that have introduced 3D CT scans that will create 3D images of carry-on bags and help security officers inspect the scanned images in 3 dimensions. This provides the security officers the capability to analyze each scanned image from various angles and make a better judgement of an object security level in the carry-ons. Though this idea is not based on AI algorithms, it is trying to solve the same problem of speeding up carry-on security bag checks at airports.

Creating the Solution

The idea is to have an artificial intelligence-based security bag-check solution that integrates with existing computational tomography (CT) scanning systems at airports. CT scans of carry-on bags at airport security yields a lot of information that security officers have to individually analyze and evaluate comprehensively in a short time. The proposed solution will use AI to help security officers make better and faster decisions by performing automated image recognition of dangerous objects in carry-on bags. It will use deep learning technology, which combines neural network architectures, to automatically identify visual differences between safe and unsafe/abnormal items in scanned images. Figure 1 shows a very high level pictorial representation of the AI solution described here, while Figure 2 and Figure 3 describe the AI algorithm in more detail.

Figure 1: Pictorial respresentation of the AI solution described in the blog

The novelty of this idea lies in the AI solution that is comprised of inference on the edge (CT scanners), with training happening in cloud. The AI solution consists of 3 major steps:

Initialization
Deployment
Re-training (happening as needed)

a. Initialization

The first step of initialization would consist of training an appropriate deep learning model before it can be deployed for use. One of the biggest challenges in this process would be to acquire enough data to train the model. The performance of neural networks (i.e. deep learning) depends on the number of input images you use to train them. For a more generic problem such as recognizing a cat in a scene, the training samples may not have to be huge to provide a reasonable level of accuracy. However, given the diverse nature of threatening items in a bag (e.g. guns, explosives, liquids, drugs, live animals, etc.) and the infinite ways in which these can be positioned in a bag, the required number of training samples will be pretty high. This challenge of training data collection can be addressed in three ways:

Collaboration with Department of Homeland Security (DHS) to make CT scan images available for training. This would be similar to what DHS is working on with Google for body scanners as mentioned above.
Perform continuous data collection at airports to create a large, secure pool of data to do training. Also while this data is being collected, the decisions made by the human operators after examining each bag can be used as labeling data, which establishes the “ground truth” information that makes the collected data suitable for training the models. This approach also alleviates the need to finance the offline labeling of the collected data, which can be very expensive. Because CT scan machines used at airports are pretty much standard devices, training data and trained models used on one device can also be used on others reliably. This also ensures a high level of standardized security checks independent of the experience of the security officers based at various airports around the globe.
Deep learning algorithms generally have feature extraction and classification as two main steps. Apart from getting real data from CT scanners at the airport, the CT scan manufacturers can work to automatically generate multiple view angles of threatening objects (e.g. guns). This will help create a subset of features for known unsafe objects, which can be fed to training model.

Figure 2: Initialization phase: Flowchart highlighting the data collection and labeling process performed by the human inspectors, with initially no AI in the loop.

Figure 2 shows a flowchart on how the AI model will be trained on safe & unsafe objects during the initialization phase. In this phase the human inspection would be happening as usual however, the human inspector would be labeling the scanned objects that will be used for model training.

b. Deployment

In the second step, the trained model will be deployed in the CT scanners at the airports, assisting security officers to identify unsafe objects in the scanned images of carry-on bags. Depending on the CT scanner model, the deployment could potentially be just an upgrade to the current software stack of the CT scanners, with AI algorithm added on to it. After the AI algorithm has been deployed in the CT scanners it will be inferring the safety level of the scanned images of carry-on bags.

Figure 3: Flowchart showing invention algorithm in deployment phase

As shown in Figure 3, once the trained model has been deployed, every time it infers the safety level of the bag that could result in multiple states, depending on the inferencing accuracy. If the AI model safety accuracy is higher than the defined criteria, the bag will pass security check and move ahead to be picked up by the traveler. However, if the accuracy is below the required criteria, human inspection would be needed. Depending on the safety of the bag, human inspector will mark bag as safe or unsafe and this information will be sent to cloud to store that as training information.

In cases, where the AI model detects bag as unsafe, human inspection will be required. The human inspector will then decide the safety level of the bag and send the labeled information on the scanned image of bag to the cloud for storing as training data.
In the description of this solution a “Negative” would mean the bag is safe and the input image does not have any identified unsafe objects in it.

The four states mentioned in Figure 3 can be described as follows:

True Negative: This will happen when the AI model accuracy is less than set criteria and it infers that the bag is safe and the security officer finds that to be true.
False Negative: This will happen when AI model accuracy is less than set criteria and it infers that the bag is safe, while on cross checking the security officer finds unsafe objects in the bag.
True Positive: When the AI model infers the input image to be unsafe and security officer agrees with that output.
False Positive: When the AI model infers the input image to be unsafe and security officer finds that to be untrue.

The solution seeks to support security officer & not replace them. The security officers will always be responsible for the final interpretation of the scanned images. In all states of the AI algorithm, the final decision made by the security officer (ground truth), along with the raw input image and the labeled data is sent to cloud database for next batch of training data. In case of false negative & false positive, the security officer will be expected to correctly label the specific objects that AI model inferred incorrectly. This will help in continuously evolving the system to improve its decision making as data collection and labeling will be performed continuously once the intelligent CT scan machines are deployed in the field. Security officers are ultimately responsible for accepting or rejecting the image tags. The system uses this feedback to ultimately improve its accuracy and robustness as it encounters more examples.

c. Re-training

Once the CT scanner with AI algorithm has been deployed and is being used, re-training of the model in the CT scanner will need to happen continuosly over cloud. The labeled input & output data from deployment stage will be collected and stored on the edge device (CT scanner) and sent to the cloud periodically for re-training the model, as shown in Figure 3. The updated re-trained model or the new classifier will be downloaded on the edge device on a regular basis to improve its level of inferecing accuracy.
After the AI algorithm at the edge device starts performing at expected accuracy for prolonged periods of time, then human inspection for the “negative” state of bags can be progressively removed, with only random cross-check of the results. Also, the retraining frequency of the model can be adjusted as needed.

Conclusion

The pieces described in this solution like image recognition, deep learning neural networks, and training and inferencing models for imaging, already exist in the ecosystem currently. The novelty of this idea will be that it will have an AI solution that integrates with current CT scanners at airports and helps automatically detect unsafe objects in scanned images of carry-on bags at airports as well as continuously learning to improve its accuracy. The author encourages the readers to further explore the applicability and benefits of this solution.

Use Transfer Learning for Efficient Deep Learning Training

One of the biggest challenges with deep learning is the large number of labeled data points that are required to train deep learning models to sufficient accuracy. For example, the ImageNet*¹ database for image recognition consists of over 14 million hand-labeled images. While the number of possible applications of deep learning systems in vision tasks, text processing, speech-to-text translation and many other domains is enormous, very few potential users of deep learning systems have sufficient training data to create models from scratch. A common concern among teams considering the use of deep learning to solve business problems is the need for training data: “Doesn’t deep learning need millions of samples and months of training to get good results?” One powerful solution is transfer learning, in which part of an existing deep learning model is re-optimized on a small data set to solve a related, but new, problem. In fact, one of the great attractions of transfer learning is that, unlike most traditional approaches to machine learning, we can take models trained on one (perhaps very large) dataset and modify them quickly and easily to work well on a new problem (where perhaps we have only a very small dataset). Transfer learning methods are not only parsimonious in their training data requirements, but they run efficiently on the same CPU-based systems that are widely used for other analytics workloads, including machine learning and deep learning inference.

Transfer Learning Basics

The idea of transfer learning is inspired by the fact that people can intelligently apply knowledge learned previously to solve new problems. For example, learning to play one instrument can facilitate faster learning of another instrument. Another good analogy is with traditional software development: We almost never write a program completely from scratch; every application makes heavy use of code libraries that take care of common functionality. Maximizing code reuse is a best practice for software development, and transfer learning is essentially the machine learning equivalent.

As described in this article Start your data analytics journey today with open source and transfer learning;² transfer learning is an artificial intelligence (AI) practice that uses data, deep learning recipes, and models developed for one task, and reapplies them to a different, but similar, task. In other words, it’s a method in machine learning where a model developed for one task is used as a starting point for a model in a second task. Reuse of pre-trained models allows improved performance when modeling the second task, hence achieving results faster.

Vast quantities of readily available data are great, but it isn’t a prerequisite for success. With modern machine learning and deep learning techniques, knowledge acquired by a machine working on one task can be transferred to a new task if the two are somewhat related. This eventually helps to reduce training time significantly, thus improving productivity of data scientists. There are three main questions to consider when trying to implement transfer learning: What to transfer, how to transfer and when to transfer. What to transfer asks which part of knowledge can be transferred across domains. Once what to transfer has been addressed the next step is to develop algorithms and models to transfer the knowledge; this falls under the how to transfer step. And the last question asks to which cases should the knowledge be transferred and more importantly not be transferred. In certain situations, brute transfer may even hurt performance of the target task, often referred to as negative transfer.

Now, that we understand some basics of transfer learning, let’s look at few use cases of transfer learning. But before we do that, let me remind you that we do not always need enormous amounts of data for end-end deep learning training. With use cases like facial recognition, where the fundamental features used for classification don’t change, there is no need to retrain the complete deep neural network. Transfer learning can be employed in such scenarios, where the features learned using a large dataset are transferred to the new network and only the classifier part is trained with the new, much smaller dataset, as shown in Figure 1.

Transfer Learning in Practice

Transfer learning has been applied in numerous places to solve variety of real world problems. One such unique example is use of transfer learning to find missing children. More than 465,676 missing children were reported to the Federal Bureau of Investigation in 2016 alone¹². More than 100,000 escort advertisements are posted online every day, and one in six children reported missing is a possible victim of sex trafficking, as reported by the National Center for Missing and Exploited Children. Intel has worked with Thorn⁴ to address the challenge of matching the images of children in the online escort ads with the pictures of known missing children. Thorn is an organization that was able to leverage technology to fight child sex trafficking and apply transfer learning to tackle their huge data challenge.^1,3 Intel helped Thorn take open source models trained on general images of adults and reuse the system to recognize and match images of trafficking victims. To further improve the ability of Thorn to find trafficking victims, Intel used transfer learning on Intel Xeon processors to retrain the model. Using a small dataset of a thousand victims, they took what the algorithm could already do, match general images of adults, and repurpose it to apply it to the new problem.

Another area where transfer learning is gaining popularity is medical image analysis. There have been several research publications^5,6,7 in the past few years on how transfer learning is being employed to detect diseases on medical images with a high level of accuracy. A few researchers have demonstrated the application of transfer learning to detect eye diseases in medical images, with accuracy comparable to human experts. They have applied transfer learning to accelerate the diagnosis of age-related macular degeneration (AMD) in medical images of the eye. What’s remarkable about this, and other application of transfer learning in medical image analysis, is that it will lead to expediting the diagnosis and referral of treatable medical conditions, resulting in early treatment and improved clinical outcomes.

There is no dearth of use cases where transfer learning can be applied with a high level of resultant accuracy. Apart from the application of transfer learning in medical image analysis, there have been studies around facial verification¹⁰, sentiment analysis⁸ as well as in mispronunciation detection⁹. The fact that transfer learning doesn’t need a huge amount of data for training and can repurpose extracted features from a different source task makes it a great technique to apply and use in a variety of domains.

Getting Started on AI with Transfer Learning

AI has the potential to revolutionize the world. There has been significant growth in the number of application domains where AI can be applied. There are a variety of ways to get started on AI, and this blog describes one of them, transfer learning. Transfer learning eliminates the need for specialized hardware and large datasets, making it easier and faster for users to deploy AI workloads. By using transfer learning, developers can use their current infrastructure with a limited amount of data and start their AI journey today. We expect transfer learning to be applicable to various domains where the learned features do not change (thanks to the rules of nature) and can be reused across domains and problems. In the future, transfer learning techniques will potentially be applied to video classification, social network analysis, and logical inference.

We encourage the readers to further explore the applicability and benefits of transfer learning. For those interested in learning more can check out our comprehensive whitepaper on the topic, here¹¹.

References:

Thinking Outside the Black Box: Why Transparency in AI is Just the Beginning

As algorithms increasingly make decisions about our lives and others’, the world is pushing for greater transparency into how they work. But while transparency is a necessary starting point, it’s not the end goal. You might have a perfectly transparent system – but that doesn’t mean that it’s an unbiased one.

Holding AI or ML accountable is a multifaceted endeavor. Doing so requires an understanding of the “why” behind their decision making. We want to know why we were rejected for a bank loan, or why apparently objective machines are participating in racial profiling.

But being able to see and understand the “why” doesn’t mean that “why” is without problems. Take our bank loan example. A system might be transparent in that it lists thousands of factors influencing its decision making, none of which are outwardly problematic. But seemingly innocuous factors such as zip code or a parent’s profession could be functioning as proxies for race.

Transparency helps us flag bias – but it isn’t enough to counteract it. So what can we do to ensure that our AIs are acting fairly and equitably?

The evil is in the data (or lack thereof)

Where algorithms go bad is in large part in the information they’re trained on. Those datasets hold all the biases, prejudices and blind spots that come out to play when they’re fed into an algorithm. If the data is racist or sexist, those issues will show up even if you exclude racial and gendered variables. Your AI will just use proxy measures instead.

To minimize bias, it’s essential to deal with these issues as close to the source as possible. That means holding ourselves – both as companies and citizens – accountable for how we collect, clean and treat our data.

Biased data is absolutely an issue in AI and ML. But it’s not the only issue that arises when dealing with large datasets. Others include the context around how our datasets are built and how the decisions of an algorithm are ultimately applied. Take the college admissions process as an example. Factors such as test score, grades and extracurriculars can be weighted as a way of highlighting potential candidates. But what about prospective students whose achievements don’t show up on those measures? How can the system identify and recognize their accomplishments?

Machines shouldn’t be making decisions that can have a major impact on people’s lives. They can’t possibly know that they’re not getting the whole picture. They’re just machines. They lack a true understanding of ethics, compassion and plain old common sense.

What we need is to admit fallibility

The solution would seem to be that humans should be the ones making those decisions – after all, we have the benefit of context, empathy and experience.

But humans also suffer from, well, being human. We all have our own internal biases and prejudices, and we also suffer from things like cognitive overload and changeable moods. Judges, for example, are more likely to be lenient early in the day or after a lunch break. We’re the reason our data is biased in the first place, and relying on humans to counteract the prejudice put into the system by humans has limited value.

The more we hear about AI “fails,” the more it becomes clear that despite our best efforts the correct decision in a situation isn’t always obvious, and that bias is hard to detect. What we need isn’t just transparency, but systems that acknowledge their own fallibility – and seek to correct it.

Algorithms may be biased, but that’s because we are. Those issues are compounded by the fact that another part of being human is that people make mistakes. We might misunderstand what an algorithm is telling us, or find bias popping up in an unexpected place. We might find that our business goals don’t map clearly to variables that can be optimized. Or we might be misled into thinking that a transparent system is necessarily a fair one.

Yes, we should be demanding transparency around the data that AI and ML learn from. But we should also be demanding it around the laws, processes and procedures that exist around a workflow where algorithms are present. Checks and balances are needed to reduce bias and error in both our data and in the way we understand and apply the results of our algorithms in the real world.

After all, transparency is meaningless unless paired with an effort to identity, mitigate and act upon the issues it spins up. Transparency is just a small part of a much larger problem.

Genomic Precision Medicine as a Small Data Machine Learning Problem

What is this blog about?

If you don’t want to read the whole thing, here is the gist:

Precision Medicine (PM) is, to a large extent, a well-established idea of selecting medical treatments based on the patient’s genetics
Major (though not all) genomic PM applications can be formulated as Machine Learning (ML) problems
These problems are commonly considered to be Big Data problems
I will argue that they are in fact Small Data problems
This has major implications for methods and software used to develop the PM applications and services for clinical use
Smallness of data for PM is a fundamental problem, not technological obstacle

What is Precision Medicine?

PM helps select medical treatments on the basis of certain special characteristics of patient. Of course, medicine has always considered personal characteristics of a patient before determining treatment; the novelty of PM is that it uses patient characteristics, such as genome data, which became available in modern times. A typical scenario is this: you see a doctor for a problem, and she orders a test which “measures” (reads) your genome. The readout is then analyzed by an algorithm, which produces a test report, which may say something like: given your genomic profile, you are likely to benefit from Drug D (Fig. 1). Based on this, and other information, and their expertise, the physician prescribes a treatment. The idea is that physician can make better/more informed choice with the help of the test report. There are lots of variations on this theme, but this is the essence.

Fig. 1. Genome-based personalized medicine.

For completeness: the PM algorithms may consider other types of information, other than genomic. Two prominent examples are: 1) imaging (X-rays, CT scans, retinal scans, etc.) 2) Electronic Health Records (HER). For this blog, I will focus on genomic data, for these reasons:

As of 2018, it is still the principal approach to PM
It is my area of expertise
Imaging and EHR-based PM are substantially different problems. The arguments I put forward for genomic PM may or may not extend to these newer areas

Precision Medicine as Machine Learning Problem

In principle, it is clear how PM can be approached as an ML problem. Consider patients suffering from some disease, and split them in two groups: those who have benefited from Drug D, and those who have not (how to define “benefit” is a separate and complicated question… let’s just say there is a binary indicator of benefit called Y). Each patient also has genomic measurement X of some sort, which can be expressed as fixed-length numeric vector. Given this setup, one could apply ML to develop a classifier which assigns the patients into the two groups, responders (those who benefit) and non-responders (those who don’t), using X as predictor variables. Once the classifier is tuned to sufficient accuracy, it can be used to estimate probability of benefit for future patients, based on their genomic readout. Our discussion is whether the problem as described is “Big Data” or not.

For completeness, a note: there is also a PM approach where X is a scalar, not vector. In other words, you measure a single genetic characteristic – for example, mutation or other alteration in a single gene – and derive treatment guidance based on presence or absence of that feature. This method is particularly widespread in oncology, and does not involve ML since it is typically based on biology, i.e., learning from “first principles”. It also has severe limitations, which I touch upon in the last section. In any case, we do not consider that case; in our setup X has measurements on multiple genes or other genomic characteristics.

Precision Medicine as Big Data Problem

First, a clarification: I focus on sample size in this post, not number of features. Thus, by Big Data, I mean lots of samples, and conversely, by Small Data, I mean few samples. The reason is that modern machine learning has effective methods of analyzing data with many features, whereas that is not the case for dealing with small sample sizes. In other words, it is sample size which really matters most, and that’s what I discuss here.

With this background, I’ll argue that the type of PM defined earlier is generally considered to be a Big Data problem. The reason for this is that in recent years there has been an extraordinary exponential growth of the genome (“sequencing”) information generated. Talks in this field almost universally display a graph of number of genomic sequences or nucleotide bases in public databanks as a function of time, pointing to exponential growth. A typical figure is shown below.

Fig. 2. Cumulative number of human genomes sequenced worldwide. From Stephens et al., Big Data: Astronomical or Genomical. PLOS Biology, 2015 (<https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195>)

The immense numbers and growth of the reported sequencing data strongly suggest that any ML application purporting to produce PM algorithms (classifiers) must be dealing with Big Data. If true, that would be a good thing, because given that much data we ought to be able to develop very accurate classifiers, and thereby help many patients.

However, the argument that PM is a Big Data problem has a flaw. First, consider some fairly obvious background.
To derive genomic classifier for typical PM application, one has to train ML algorithms using predictor variables X (i.e., the genome data) and dependent variable Y (i.e., “phenotypic” data). In other words, for each “training set” patient there must be X and Y available. The Y data depends on the disease of interest. Some types of binary Y information are: who will experience a particular disease outcome (for example, worsening of disease, cancer recurrence, death) in a given timeframe (this is called prognosis); which patients from particular disease benefit from a given drug (this is called prediction; for example antibiotic resistance); and which patients have a particular disease or disease type (for example, bacterial vs. viral infection), given a genomic signature found in their body (blood, tumor tissue, nasal swab etc.; this is called diagnosis).

Given this background, the key problem with the Big Data argument is that the growth of Y data lags far behind the genomic (X) data. You will seldom, if ever, see a chart similar to growth of sequencing data, but illustrating the growth of the phenotypic data. And yet, it is Y that determines what data is available for training PM algorithms, because, as we’ve seen, we must have X and Y for each patient. In fact, my experience and large published literature show that the amount of Y data for almost any specific clinical problem of interest is severely limited. There are many reasons, but a critical one is time: it can take a long time to acquire sufficient numbers of Y (dependent variable) data.

Consider, for example, a hugely successful application of PM in guiding the chemotherapy treatment for early-stage breast cancer patients (<https://tinyurl.com/ybfgdaxq>). In this case, a classifier was trained to predict whether the patient’s cancer will recur within 10 years of diagnosis (recurrence). If the cancer is likely to recur, chemotherapy can be given to reduce the chances of that happening. On the other hand, vast majority of patients do not need the toxic chemo because their cancers are not coming back. Therefore it is of great interest to predict recurrence, and limit chemo to just those patients likely to recur. To train the model to predict recurrence, as usual two sources of data must be available:

The genomic data X, which is more or less readily available in case of breast cancer
The 10-year recurrence information: Y = 1 if cancer recurred, 0 otherwise. Depending on the type of ML model, time of recurrence is also recorded.

In other words, we must wait full 10 years for each patient, to learn their Y variable (and time, if used). In practice, the wait is longer, because we can’t recruit the full cohort of patients on Day 1. Needless to say, this imposes severe limitations on the development of this type of tests, because few businesses (and investors) can afford/are willing to wait this long for data. In fact, it is remarkable that this particular application has been developed at all. Worst of all, very little can be done to alleviate this problem – no amount of investment or technological progress can fundamentally alter the fact that we have to wait for outcomes to happen.

Given this situation, we see that the sample size for genomic PM is truly driven by the amount of outcome (Y) data, not by X. And given the times it takes to acquire typical outcomes, the numbers commonly available to PM algorithm developers commonly number in the hundreds, if not less. I experienced this first hand, having spent 15 years developing genomic PM applications, and being always short on data, without exception. For this reason typical PM ML applications are in fact Small Data problems. And for this reason, such applications are not very widespread, as evidenced by the fact that there are only a handful such products in clinical use.

Precision Medicine as Small Data Problem

Given the above reasoning, we argued that ML-based genomic PM applications are Small Data problems. This brings up two more questions. 1) what exactly is “number of samples”? what does “Small” actually mean (i.e., how many samples constitute Small Data).

Let’s discuss 1) first. This sounds like a very simple question: count the number of patients available in training/test sets, and there is your sample size. But that ignores the question of prevalence, i.e., the unbalanced datasets. In medicine, the sample sizes in two classes are typically unbalanced, sometimes dramatically. Consider, for example, the STRIVE study by company GRAIL (<https://grail.com/clinical-studies/>), which aims to recruit 120,000 breast cancer patients. This is gigantic by any standards.

But based on breast cancer incidence data, only about 600 (0.5%) women in the study will be diagnosed with cancer during the study duration. Therefore the effective sample size is closer to 1,200 (600 “positive” and 600 “negative”). Of course, you can use all the 100,000+ negatives, but the gain from that is only better estimate of specificity (at a cost of slower learning). To fully characterize the test, one needs sensitivity, which is driven by the number of “positive” patients, which is just around 600. In other words, the effective sample size is driven by the cardinality of the smallest class, not total.

Back to the second question: what is the threshold for Big Data? It is, of course, a judgment call, but my guess is that couple of hundred samples per class does not qualify as Big Data in terms of sample sizes. I’ll also say that the small sample sizes are a fundamental limitation due to time-to-outcome and epidemiology (prevalence of diseases), not technological obstacle.

Methods and Software

Approaching PM as Small Data problem has significant implications for machine learning methods and software. That could easily fill another blog, or even a book. In my view, a key implication is that the common split of data into training, validation and test is problematic in this context. The reason is that if you have say 100 samples per class, and split them, for example, 60/20/20, then the performance estimates obtained on the 20 validation/test samples are hugely variable and certainly not sufficiently robust to develop diagnostic products for use on patients. Another challenge: if a small validation set is used repeatedly for model tuning, it becomes, in effect, a training set, because the model becomes tuned to the validation set, not future populations of patients. With millions or thousands of samples per class available, this would be basically non-issue; but with low hundreds, or less, it becomes a major problem.

My approach is the use of elaborate cross-validation techniques, but that is certainly not the only way. I’ll just add that in general I have not found adequate support for small sample analyses in leading machine learning frameworks, and invariably had to develop in-house solutions for my work.

PAQ (Potentially Asked Questions)

This section is similar to FAQ, except that no one actually asked me these questions, so I can’t call them “Frequently Asked”. But I expect they might be asked.

Isn’t all this obvious? Doesn’t everyone understand the need to have outcome data?

You would think. Despite that, to my knowledge genomic PM has not been commonly characterized as Small Data in publications, talks, blogs etc. In particular I haven’t heard much discussion in public sphere around the key problem of time-to-outcome. Of course, plenty of my colleagues who work in the field are fully aware of these limitations, but I thought it’d be helpful to bring them up in an accessible blog format.

What About Other Applications, Like NIPT and Immunotherapy? Don’t they prove you can practice Precision Medicine without ML?

Yes, PM using genomic data can and is being practiced without ML. The NIPT (Non-invasive Prenatal Testing) is a major success story. But that’s an exception – otherwise, the record is mixed at best.

For example, in cancer care the approach is this: consider drugs which target particular mutations. Since we know the target, we only give the drugs to people with those genomic variations. Hence there is no need for any ML, and the Big Data/Small Data argument is moot.

In principle, this is very elegant idea because it is much simpler than ML. But so far, the record isn’t stellar (see for example https://jamanetwork.com/journals/jamaoncology/article-abstract/2713848 and excellent podcast https://itunes.apple.com/us/podcast/plenary-session/id1429998903?mt=2, Episode 1.20). I don’t find that surprising because it seems implausible that a treatment response of human being, composed of trillions of cells and countless interactions, would be accurately predicted by a single genomic feature.

Bottom line: single gene methods work for NIPT, but otherwise to a very limited extent. For truly precise targeting, it seems necessary to learn from data – at least until we master biology such that we can accurately predict outcomes from first principles. Not an easy trade-off, but that’s where we are.

What about imaging? Isn’t there plenty of phenotype data there?

Possibly. I focused on genomics in this blog, but couple of points regarding imaging: there is indeed plenty of annotated images available in public databases, various ML challenges and elsewhere. The crucial question is what is considered “outcome”. If an outcome is an expert-annotated feature of image (for example, “opacity” in X-ray images), then there is large amount of data, and it is available very soon after image is taken. In that case the key argument (that it takes long time to acquire phenotypic data for PM) does not apply. Hence, those applications may indeed take advantage of Big Data (and some have; we should see in the near future if they are adopted by physicians). However, if an outcome is a patient-centered feature, like onset of symptoms, or clinical event (stroke, cancer, confirmed diagnosis of Alzheimer’s etc.), then the phenotype takes a long time to observe, and the available numbers become dramatically smaller. Bottom line: I don’t know if there is a clear answer on this but we may know more relatively soon.

Isn’t there some health data available for people whose genomes are in public databanks? That’s a lot of people – sounds like Big Data

There is indeed some phenotype data for almost all samples in the large databases. But to develop a clinical PM test, you need exact outcomes for a particular condition, not generic health data. For example, for those working on cardiovascular diseases, the large collection of tumor genomes+diagnoses in TCGA (The Cancer Genome Atlas) isn’t too helpful, and vice versa. The numbers rapidly become small as soon as you focus on a specific disease.

What about EHRs? Isn’t that truly Big Data PM, with many patients?

Quite possible in the near/medium future, but at present, seems too early to tell: “Can you really tell how good a new medicine is from Flatiron’s data? Even Nat and Zach admit that the jury is still out.” (https://www.forbes.com/sites/matthewherper/2018/11/14/at-24-two-entrepreneurs-took-on-cancer-at-32-theyre-worth-hundreds-of-millions/#383c9a42d689)

Why are people collecting all these genomes? Isn’t that a waste of time, if there are no phenotypes?

Certainly not waste of time – crucial for research in biology, drug discovery, genome association studies etc. But most of the genomic Big Data, I am arguing, isn’t directly applicable for genome-driven patient care (i.e., PM) due to lack of usable outcomes.

How about unsupervised learning? It doesn’t need phenotypes

That is very far from clinical applications. Who knows what happens in far future, but for the time being, there is no unsupervised PM.

Posts Filed Under: