Abstract Title: Machine Learning to Program Living Cells
Abstract Summary: Over the past two decades, the ability to engineer increasingly complex genetic circuits, sequences of DNA that encode computational operations in living cells, has advanced rapidly. Progress has resulted from several factors including faster and cheaper DNA sequencing, an improved understanding of cell biophysics and the ability to make targeted genomic modifications using CRISPR.
Yet despite our progress, biological engineers often spend years creating a single functional design through manual trial-and-error. Drawing inspiration from machine learning and digital logic synthesis, we built a genetic circuit design automation platform, Cello. Cello has aided in the design of some of the most complex genetic circuitry to date.
HA) Tell us something about yourself and your work.
JI) Hey everyone, my name is Joe. I manage the software engineering organization at Asimov, a small startup in Cambridge, MA focused on programming living cells to unlock previously impossible biotechnologies. Specifically, we build genetic circuits – networks of DNA-encoded genes that dynamically sense and respond to their environment – to design and manufacture next generation therapeutics.
Software engineering at Asimov helps achieve this mission by developing tools to accelerate research and development lifecycle of biological engineering. We work across four areas: robotic automation, data science, infrastructure and machine learning. Together, these teams build tools similar in concept to those found in aeronautical or civil engineering: an advanced computer aided design (CAD) platform for synthetic biology. This platform helps biologists design genetic circuits, translate these designs into physical DNA, measure the effects of circuits on a target organism, and automatically learn to improve designs from previous experiments.
Personally, I enjoy solving engineering problems at the intersection of machine learning, software systems, data infrastructure and biology. On the technical side, I’ve spent some time building products across computer vision, forensics, search ranking, recommendation systems and ad targeting. Across all of these applications, I have always enjoyed seeing the impact a software system has on a company or on a customer. The past several years I’ve been working as an engineering manager, where I’ve grown a passion for helping the teams I work with achieve tremendous goals while minimizing unnecessary process.
HA) What inspired you to pursue machine learning and genetics?
JI) Machine learning is mostly an accident. A majority of my family are professors, largely in applied mathematics, with some physics and biology scattered about. You could say my passion for applied math is … genetic! Entering college I wanted to study multiple facets of mathematics and its intersection with computer science. My favorite subject was numerical programming so I wanted to find a job where I could use this skill. My first full-time position out of school was at MIT’s Lincoln Laboratory where I was assigned to a computer vision / object recognition project. This was well before deep learning ate the world; I remember sitting down, pulling up papers and books on computer vision, coming across references to methods such as support vector machines and random forests. It was awesome when I realized that there was a whole field of study *machine learning* and that many of the techniques I learned in numerical linear algebra were transferable!
To the genetics side, it again stems from family. My mother was diagnosed with breast cancer when I was a child. At the time, I watched my father (an applied math professor) pivot his research towards understanding the mechanisms of cancer biology and methods for early screening and detection. While I was too young to understand the details of his research, it left a strong impression; I knew I would want to contribute one day as well. In my career I’ve had a few chances to work on biological applications, but most recently Asimov has afforded me the opportunity to build an organization focused on using machine learning and software tools to impact R&D in biology.
HA) How do you combine molecular dynamic simulations with machine learning while designing the genetic circuitry?
JI) Our platform does not currently use molecular dynamics simulation for designing genetic circuits. We use machine learning to solve two fundamental problems: 1. Given a genetic circuit, can we predict its effect on a host organism? 2. Given a desired behavior and access to a library of DNA “parts” can we predict the ideal composition of parts (aka a circuit) to maximize the likelihood that the circuit will produce the desired behavior.
Given these objectives, we can collect data about the performance of parts and circuits; this is similar to collecting machine learning features. Specifically, we capture data at genomic (DNA), transcriptomic (RNA) and proteomic (protein) resolutions; this gives us high resolution information about the effect of our circuit components on host organisms.
Similar to testing a new recommendation system or ad targeting system, we optimize learners offline on a training set, backtest, and, if offline metrics look good, push the model to production – that is, use the model to design new circuits and test them inside living cells in the lab.
HA) FDA approval and arduous clinical trials is probably what makes the biotech products get to markets at a slower pace as compared to the general tech products. What additional challenges do you face specifically in the realm of biotech?
JI) A/B testing a new recommendation algorithm is certainly faster than spinning up a clinical trial! But this is necessary; the cost of a bad recommendation algorithm is minor to the cost of releasing a toxic drug. Especially in the field of machine learning, there are a couple core challenges rate limiting fast iterations:
The first is dataset generation. Relative to tech products, biotech data generation is high cost, high latency and low throughput. While all three axis are improving exponentially, in absolute value, data collection is not yet at the scale of products like advertisement clicks. Any one of our experiments generates large amounts of raw data (100s of GBs across DNA/RNA/protein quantification), but the number of experiments that can be run in parallel is comparatively small. This forces us to be creative in our machine learning methodology; especially in how we fuse real world data with simulations to enhance accuracy of models. I would draw a parallel to Waymo’s self-driving car simulator; to increase the number of and variance of real world test drives, careful simulations are run to supplement real world drives.
A second challenge is the speed of “online testing”. Building “offline” models is as fast for us as any company: we pull data, train models and test models. But testing a new model “online” in our context requires generating many circuit designs from a test model, physically building predicted circuit DNA, inserting this DNA into a host organism, growing the organism and then measuring some output target (e.g., protein level). This process can take several months.
HA) What is your vision for launching intelligent therapeutic products for the next 5 years? 20 years?
JI) Over the next couple decades we will continue to see a shift in the pharmaceutical industry as companies grow their R&D pipelines from exclusively small, organic molecule synthesis and screening to invest in more biologics, cell therapies and gene therapies. Even today, several of the globally best selling biotechnology drugs are monoclonal antibodies, a class of biologic. In some sense this demonstrates an initial launch of intelligent therapeutics: such treatments are engineered to be highly specific (efficacious with few side effects).
And as biological engineering evolves to more quickly design, screen and manufacture biologically inspired treatments we will unlock more intelligent therapeutics. In one approach, we will be able to harness genetic circuits to continually sense your body’s environment, to screen and monitor for abnormalities and, upon detecting them, to interface with endogenous pathways to eliminate the issue.
HA) Why genetic circuits are not being widely harnessed into biotechnology today?
JI) Genetic circuits are ubiquitous throughout nature. As we describe in our blog,
“In a lone bacterium as it “tumbles and runs” toward food. In a California redwood as it constructs itself into the sky. And in your immune system as it wards off cancer and infection. In fact, every single thing that civilization sources from biology – food, materials, drugs – was built by nature using genetic circuits to exert fine spatiotemporal control over biochemistry. “
However, as you mention, circuits are not harnessed widely in biotechnology companies today. Designing even a single, functional circuit is a complex process involving discovery of genetic parts, measuring effects on host organisms, eliminating toxicity and minimizing off-target effects. Until recently, circuit discoveries were manual, time consuming and prone to error, built one-off in laboratories. It is through platforms like Asimov that we are just now able to computationally reduce the search space for functional designs. With such successes, we will certainly see the industry widely adopt genetic circuits as tools for manufacturing. It is akin to watching the aerospace industry shift from manually designed gliders flown at Kitty Hawk to fully automated software tools for designing commercial jets and rockets.
HA) What is the status quo in our understanding of the the genetic circuits; what percentage of the patterns that exist in genetic circuits do we understand? Have you considered the ethical implications of decoding genetic circuits?
JI) We have a lot to learn about the full functionality and power of genetic circuits! The behaviors of simple circuits in a few target organisms (e.g., the lac operon in E. coli) are well characterized but we have a lot to learn to achieve high accuracy, human cell engineering.
To the question of ethics, it’s a matter we take seriously. Engineering at a population level; be it bacteria or humans, can have vast, unintended consequences. Even at our early stage, we have sponsored efforts such as GP-write – a global consortium of leading biologists focused on learning how to efficiently and safely “write” DNA. This consortium has a dedicated group of experts who study, debate and develop ethical guidelines for synthetic biology labs, globally. We are proud to work with such organizations to ensure we develop biotechnologies that help society by addressing global challenges.
HA) Who are your competitors? Do you collaborate with hospitals, research labs and other biotech companies to gather data for your CELLO platform?
JI) The CELLO platform was developed by a partnerships between MIT and Boston University. We continue to collaborate closely with both. In addition, we’ve had fruitful partnerships with The Broad Institute.
At this time, there are no direct competitors in our space.
HA) Which journal papers and industry conferences do you follow?
JI) There are many! We follow journals such as Nature, Science and (of course!) bioarXiv as well as conferences such as ICML, ICLR, NIPS, IWBDA, SynBioBeta, Open Data Science Conference and of course, ML conf.
Bio: Joe is the VP of Engineering at Asimov, a startup with the mission to program living cells. We leverage techniques from synthetic biology, systems engineering and machine learning to automate the compilation of genetic circuits. Prior to Asimov, Joe lead machine learning teams at Quora and at URX (Y-combinator, acquired by Pinterest). In these roles, Joe helped to build recommendation systems to personalize content discovery, algorithms to optimize advertisement targeting and machine learning models for search and discovery. Prior to joining the startup world, Joe contributed to research at MIT Lincoln Laboratory and Brown University, developing numerical methods for solving problems across forensics, computer vision and molecular dynamics.
Bio: Himani Agrawal, PhD is a Data Scientist at AT&T Chief Data Office. She has a very interdisciplinary research background encompassing fields such as applied mathematics, biophysics and data science. Apart from work, she is passionate about promoting women in technology and actively participates with the Anita Borg Institute, Women in Machine Learning and Data Science, Women in Machine Learning and Society of Women Engineers. She is actively involved in the machine learning and data science communities at AT&T: Data Night Live and Data Powered Insights. She has had speaking engagements at international conferences like Grace Hopper Celebration of Women in Computing, Society of Women Engineers Annual Conference, American Society of Mechanical Engineers and Society of Engineering Science Annual Technical Meeting.
She is passionate about Italian Classical singing and currently trains with University of North Texas Voice Professor Dr. Lauren McNeese.