One of our Program Committee members, Sarah Braden, recently interviewed Adam Omidpanah, Biostatistician, Washington State University. This interview covers how recent advances in machine learning have impacted research in health care delivery.
SB) Tell us briefly about yourself and your work.
AO) I am a biostatistician with Washington State University’s college of nursing. I work with a variety of junior collaborators helping them with study design, data analysis, and manuscript preparation. Our work mainly focuses on health care delivery and disparities in marginal US populations, particularly American Indians and Alaska Natives. One of the greatest challenges to my work is shaping a complex causal model which incorporates psychosocial medicine with overt medical conditions, such as the relation between sleep and depression or PTSD and diabetes complications.
SB) This past month Google released a paper on ArXiv (https://arxiv.org/abs/1703.02442) demonstrating a convolutional neural network (CNN) model that has better small tumor detection rates compared to human pathologists. The context for this work was improving breast cancer metastasis detection in lymph nodes. With better metastasis detection more appropriate patient treatment plans could be chosen, and potentially improve patient outcomes. How are other advances in Machine Learning improving the accuracy of cancer diagnoses and risk prediction?
AO) I think ROC regression is a promising area of research. On one hand, ROC regression combines covariates to directly maximize an ROC curve, but its operating characteristics are very irregular. I think boosting and bagging are also promising, and it is only a matter of time before a diagnostic tool is developed that requires knowledge about several hundred genes simultaneously. I am also happy that risk stratification table indices like the integrated discrimination index, have recently fallen out of favor. I always found their performance to be overly sensitive to assumptions, and recent work from Dr. Katie Kerr and Dr. Margaret Pepe have proven this (and other issues) to be a cause for concern.
SB) What are the barriers to introducing machine learning into clinical practice? Is there a protocol for using machine learning models in the field of oncology?
AO) Randomized clinical trials are a gold standard for demonstrating effectiveness of any new cancer detection or treatment tool. Any machinery involving ML algorithms should be subjected to the same rigorous testing, and also confirmed in secondary trials to reduce false positive findings. In that sense I don’t see these as barriers. The clinical perception toward these tools is very positive: generally, there is a desire to reduce human error. The protocol for using machine learning model is pragmatic.
SB) What datasets are openly available for researchers to work on cancer risk prediction?
AO) SEER is perhaps the most widely known. The Center For Medicare Services (CMS) has recently merged SEER data with Medicare data and provided a 10% subsample of the US population without cancer. However, I would encourage interested investigators to research the National Cancer Institute (NCI)’s many accessible databases. Open data are useful, but closed data aren’t inaccessible.
SB) How do concerns about data privacy affect Machine Learning research in oncology?
AO) Most concerns relate specifically to the technology: genetics in particular has been a problematic area of research. Biospecimen donation rates are relatively low, especially for racial and ethnic minorities. And yet, these are the people often most adversely affected by cancer. Promoting biospecimen donation among racial and ethnic minorities could improve precision medicine and reduce disparities.
SB) Clinical science datasets often have large p, small n problems where the number of predictor variables (or features) is larger than the number of observations. Missing data and unbalanced classes are other common issues. What techniques do you use to overcome these challenges in your own work?
AO) An unpopular approach to p >> n is to simply refine the scope of the problem. Very rarely do I encounter a dataset where all features can be assigned equal weight. Combining relevant features in practical ways, and excluding others which have no relation to a pre-specified scientific question usually clearly points to a way forward. Missing data methods and unbalanced classes, while unrelated, all have methods involving applying high dimensional prediction algorithms to either impute or propensity match, thus improving the performance of regression models using those data. Recently I’ve had some success estimating such a high dimensional prediction algorithm using, simply, log linear models with splines and BIC for model selection.
Adam Omidpanah is a biostatistician whose interest in machine learning began during his graduate studies. Adam holds a Bachelors of Science, Mathematics from Portland State University and a Masters of Science, Biostatistics from University of Washington.
Sarah Braden is currently a Data Scientist at the startup HireIQ Solutions, Inc. There she specializes in developing predictive models for HireIQ’s automated interviewing platform. She also writes tools for HireIQ using automated speech recognition. Sarah is a fan of open source technology. She holds a PhD in Geological Sciences from the School of Earth and Space Exploration at Arizona State University and a Bachelors in Physics from Northwestern University.