Genomic Precision Medicine as a Small Data Machine Learning Problem

What is this blog about?
If you don’t want to read the whole thing, here is the gist:

  • Precision Medicine (PM) is, to a large extent, a well-established idea of selecting medical treatments based on the patient’s genetics
  • Major (though not all) genomic PM applications can be formulated as Machine Learning (ML) problems
  • These problems are commonly considered to be Big Data problems
  • I will argue that they are in fact Small Data problems
  • This has major implications for methods and software used to develop the PM applications and services for clinical use
  • Smallness of data for PM is a fundamental problem, not technological obstacle

What is Precision Medicine?

PM helps select medical treatments on the basis of certain special characteristics of patient. Of course, medicine has always considered personal characteristics of a patient before determining treatment; the novelty of PM is that it uses patient characteristics, such as genome data, which became available in modern times. A typical scenario is this: you see a doctor for a problem, and she orders a test which “measures” (reads) your genome. The readout is then analyzed by an algorithm, which produces a test report, which may say something like: given your genomic profile, you are likely to benefit from Drug D (Fig. 1). Based on this, and other information, and their expertise, the physician prescribes a treatment. The idea is that physician can make better/more informed choice with the help of the test report. There are lots of variations on this theme, but this is the essence.


Fig. 1. Genome-based personalized medicine.

For completeness: the PM algorithms may consider other types of information, other than genomic. Two prominent examples are: 1) imaging (X-rays, CT scans, retinal scans, etc.) 2) Electronic Health Records (HER). For this blog, I will focus on genomic data, for these reasons:

  • As of 2018, it is still the principal approach to PM
  • It is my area of expertise
  • Imaging and EHR-based PM are substantially different problems. The arguments I put forward for genomic PM may or may not extend to these newer areas

Precision Medicine as Machine Learning Problem

In principle, it is clear how PM can be approached as an ML problem. Consider patients suffering from some disease, and split them in two groups: those who have benefited from Drug D, and those who have not (how to define “benefit” is a separate and complicated question… let’s just say there is a binary indicator of benefit called Y). Each patient also has genomic measurement X of some sort, which can be expressed as fixed-length numeric vector. Given this setup, one could apply ML to develop a classifier which assigns the patients into the two groups, responders (those who benefit) and non-responders (those who don’t), using X as predictor variables. Once the classifier is tuned to sufficient accuracy, it can be used to estimate probability of benefit for future patients, based on their genomic readout. Our discussion is whether the problem as described is “Big Data” or not.

For completeness, a note: there is also a PM approach where X is a scalar, not vector. In other words, you measure a single genetic characteristic – for example, mutation or other alteration in a single gene – and derive treatment guidance based on presence or absence of that feature. This method is particularly widespread in oncology, and does not involve ML since it is typically based on biology, i.e., learning from “first principles”. It also has severe limitations, which I touch upon in the last section. In any case, we do not consider that case; in our setup X has measurements on multiple genes or other genomic characteristics.

Precision Medicine as Big Data Problem

First, a clarification: I focus on sample size in this post, not number of features. Thus, by Big Data, I mean lots of samples, and conversely, by Small Data, I mean few samples. The reason is that modern machine learning has effective methods of analyzing data with many features, whereas that is not the case for dealing with small sample sizes. In other words, it is sample size which really matters most, and that’s what I discuss here.

With this background, I’ll argue that the type of PM defined earlier is generally considered to be a Big Data problem. The reason for this is that in recent years there has been an extraordinary exponential growth of the genome (“sequencing”) information generated. Talks in this field almost universally display a graph of number of genomic sequences or nucleotide bases in public databanks as a function of time, pointing to exponential growth. A typical figure is shown below.

Fig. 2. Cumulative number of human genomes sequenced worldwide. From Stephens et al., Big Data: Astronomical or Genomical. PLOS Biology, 2015 (<>)

The immense numbers and growth of the reported sequencing data strongly suggest that any ML application purporting to produce PM algorithms (classifiers) must be dealing with Big Data. If true, that would be a good thing, because given that much data we ought to be able to develop very accurate classifiers, and thereby help many patients.

However, the argument that PM is a Big Data problem has a flaw. First, consider some fairly obvious background.

To derive genomic classifier for typical PM application, one has to train ML algorithms using predictor variables X (i.e., the genome data) and dependent variable Y (i.e., “phenotypic” data). In other words, for each “training set” patient there must be X and Y available. The Y data depends on the disease of interest. Some types of binary Y information are: who will experience a particular disease outcome (for example, worsening of disease, cancer recurrence, death) in a given timeframe (this is called prognosis); which patients from particular disease benefit from a given drug (this is called prediction; for example antibiotic resistance); and which patients have a particular disease or disease type (for example, bacterial vs. viral infection), given a genomic signature found in their body (blood, tumor tissue, nasal swab etc.; this is called diagnosis).

Given this background, the key problem with the Big Data argument is that the growth of Y data lags far behind the genomic (X) data. You will seldom, if ever, see a chart similar to growth of sequencing data, but illustrating the growth of the phenotypic data. And yet, it is Y that determines what data is available for training PM algorithms, because, as we’ve seen, we must have X and Y for each patient. In fact, my experience and large published literature show that the amount of Y data for almost any specific clinical problem of interest is severely limited. There are many reasons, but a critical one is time: it can take a long time to acquire sufficient numbers of Y (dependent variable) data.

Consider, for example, a hugely successful application of PM in guiding the chemotherapy treatment for early-stage breast cancer patients (<>). In this case, a classifier was trained to predict whether the patient’s cancer will recur within 10 years of diagnosis (recurrence). If the cancer is likely to recur, chemotherapy can be given to reduce the chances of that happening. On the other hand, vast majority of patients do not need the toxic chemo because their cancers are not coming back. Therefore it is of great interest to predict recurrence, and limit chemo to just those patients likely to recur. To train the model to predict recurrence, as usual two sources of data must be available:

  • The genomic data X, which is more or less readily available in case of breast cancer
  • The 10-year recurrence information: Y = 1 if cancer recurred, 0 otherwise. Depending on the type of ML model, time of recurrence is also recorded.

In other words, we must wait full 10 years for each patient, to learn their Y variable (and time, if used). In practice, the wait is longer, because we can’t recruit the full cohort of patients on Day 1. Needless to say, this imposes severe limitations on the development of this type of tests, because few businesses (and investors) can afford/are willing to wait this long for data. In fact, it is remarkable that this particular application has been developed at all. Worst of all, very little can be done to alleviate this problem – no amount of investment or technological progress can fundamentally alter the fact that we have to wait for outcomes to happen.

Given this situation, we see that the sample size for genomic PM is truly driven by the amount of outcome (Y) data, not by X. And given the times it takes to acquire typical outcomes, the numbers commonly available to PM algorithm developers commonly number in the hundreds, if not less. I experienced this first hand, having spent 15 years developing genomic PM applications, and being always short on data, without exception. For this reason typical PM ML applications are in fact Small Data problems. And for this reason, such applications are not very widespread, as evidenced by the fact that there are only a handful such products in clinical use.

Precision Medicine as Small Data Problem

Given the above reasoning, we argued that ML-based genomic PM applications are Small Data problems. This brings up two more questions. 1) what exactly is “number of samples”? what does “Small” actually mean (i.e., how many samples constitute Small Data).

Let’s discuss 1) first. This sounds like a very simple question: count the number of patients available in training/test sets, and there is your sample size. But that ignores the question of prevalence, i.e., the unbalanced datasets. In medicine, the sample sizes in two classes are typically unbalanced, sometimes dramatically. Consider, for example, the STRIVE study by company GRAIL (<>), which aims to recruit 120,000 breast cancer patients. This is gigantic by any standards. But based on breast cancer incidence data, only about 600 (0.5%) women in the study will be diagnosed with cancer during the study duration. Therefore the effective sample size is closer to 1,200 (600 “positive” and 600 “negative”). Of course, you can use all the 100,000+ negatives, but the gain from that is only better estimate of specificity (at a cost of slower learning). To fully characterize the test, one needs sensitivity, which is driven by the number of “positive” patients, which is just around 600. In other words, the effective sample size is driven by the cardinality of the smallest class, not total.

Back to the second question: what is the threshold for Big Data? It is, of course, a judgment call, but my guess is that couple of hundred samples per class does not qualify as Big Data in terms of sample sizes. I’ll also say that the small sample sizes are a fundamental limitation due to time-to-outcome and epidemiology (prevalence of diseases), not technological obstacle.

Methods and Software

Approaching PM as Small Data problem has significant implications for machine learning methods and software. That could easily fill another blog, or even a book. In my view, a key implication is that the common split of data into training, validation and test is problematic in this context. The reason is that if you have say 100 samples per class, and split them, for example, 60/20/20, then the performance estimates obtained on the 20 validation/test samples are hugely variable and certainly not sufficiently robust to develop diagnostic products for use on patients. Another challenge: if a small validation set is used repeatedly for model tuning, it becomes, in effect, a training set, because the model becomes tuned to the validation set, not future populations of patients. With millions or thousands of samples per class available, this would be basically non-issue; but with low hundreds, or less, it becomes a major problem.

My approach is the use of elaborate cross-validation techniques, but that is certainly not the only way. I’ll just add that in general I have not found adequate support for small sample analyses in leading machine learning frameworks, and invariably had to develop in-house solutions for my work.

PAQ (Potentially Asked Questions)

This section is similar to FAQ, except that no one actually asked me these questions, so I can’t call them “Frequently Asked”. But I expect they might be asked.

Isn’t all this obvious? Doesn’t everyone understand the need to have outcome data?

You would think. Despite that, to my knowledge genomic PM has not been commonly characterized as Small Data in publications, talks, blogs etc. In particular I haven’t heard much discussion in public sphere around the key problem of time-to-outcome. Of course, plenty of my colleagues who work in the field are fully aware of these limitations, but I thought it’d be helpful to bring them up in an accessible blog format.

What About Other Applications, Like NIPT and Immunotherapy? Don’t they prove you can practice Precision Medicine without ML?

Yes, PM using genomic data can and is being practiced without ML. The NIPT (Non-invasive Prenatal Testing) is a major success story. But that’s an exception – otherwise, the record is mixed at best.

For example, in cancer care the approach is this: consider drugs which target particular mutations. Since we know the target, we only give the drugs to people with those genomic variations. Hence there is no need for any ML, and the Big Data/Small Data argument is moot.

In principle, this is very elegant idea because it is much simpler than ML. But so far, the record isn’t stellar (see for example and excellent podcast, Episode 1.20). I don’t find that surprising because it seems implausible that a treatment response of human being, composed of trillions of cells and countless interactions, would be accurately predicted by a single genomic feature.

Bottom line: single gene methods work for NIPT, but otherwise to a very limited extent. For truly precise targeting, it seems necessary to learn from data – at least until we master biology such that we can accurately predict outcomes from first principles. Not an easy trade-off, but that’s where we are.

What about imaging? Isn’t there plenty of phenotype data there?

Possibly. I focused on genomics in this blog, but couple of points regarding imaging: there is indeed plenty of annotated images available in public databases, various ML challenges and elsewhere. The crucial question is what is considered “outcome”. If an outcome is an expert-annotated feature of image (for example, “opacity” in X-ray images), then there is large amount of data, and it is available very soon after image is taken. In that case the key argument (that it takes long time to acquire phenotypic data for PM) does not apply. Hence, those applications may indeed take advantage of Big Data (and some have; we should see in the near future if they are adopted by physicians). However, if an outcome is a patient-centered feature, like onset of symptoms, or clinical event (stroke, cancer, confirmed diagnosis of Alzheimer’s etc.), then the phenotype takes a long time to observe, and the available numbers become dramatically smaller. Bottom line: I don’t know if there is a clear answer on this but we may know more relatively soon.

Isn’t there some health data available for people whose genomes are in public databanks? That’s a lot of people – sounds like Big Data

There is indeed some phenotype data for almost all samples in the large databases. But to develop a clinical PM test, you need exact outcomes for a particular condition, not generic health data. For example, for those working on cardiovascular diseases, the large collection of tumor genomes+diagnoses in TCGA (The Cancer Genome Atlas) isn’t too helpful, and vice versa. The numbers rapidly become small as soon as you focus on a specific disease.

What about EHRs? Isn’t that truly Big Data PM, with many patients?

Quite possible in the near/medium future, but at present, seems too early to tell: “Can you really tell how good a new medicine is from Flatiron’s data? Even Nat and Zach admit that the jury is still out.” (

Why are people collecting all these genomes? Isn’t that a waste of time, if there are no phenotypes?

Certainly not waste of time – crucial for research in biology, drug discovery, genome association studies etc. But most of the genomic Big Data, I am arguing, isn’t directly applicable for genome-driven patient care (i.e., PM) due to lack of usable outcomes.

How about unsupervised learning? It doesn’t need phenotypes

That is very far from clinical applications. Who knows what happens in far future, but for the time being, there is no unsupervised PM.


Ljubomir Buturovic

Ljubomir Buturovic, PhD, is Vice President of Informatics at Inflammatix Inc., where his team leverages novel data science, machine learning and artificial intelligence to develop and validate the robust clinical algorithms that power the company’s infectious diseases diagnostics. He previously served as Chief Scientist at Pathwork Diagnostics, Inc., where he led predictive algorithm development for two FDA-cleared genomics tests for cancer. Prior to that, he was Bioinformatics Director at Incyte Corporation. He has served as Adjunct Professor of Computer Science at San Francisco State University since 2005. Dr. Buturovic received his PhD at the School of Electrical Engineering, University of Belgrade, Serbia, and did postdoctoral training at Boston University’s BioMolecular Engineering Research Center.