GENES CARRY THE information that make you you. So it’s fitting that, when sequenced and stored in a computer, your genome takes up gobs of memory—up to 150 gigabytes. Multiply that across all the people who have gotten sequenced, and you’re looking at some serious storage issues. If that’s not enough, mining those genomes for useful insight means comparing them all to each other, to medical histories, and to the millions of scientific papers about genetics.
Sorting all that out is a perfect task for artificial intelligence. And plenty of AI startups have bent their efforts in that direction. On August 3, sequencing company Veritas Genetics bought one of the most influential: seven-year old Curoverse. Veritas thinks AI will help interpret the genetic risk of certain diseases and scour the ever-growing databases of genomic, medical, and scientific research. In a step forward, the company also hopes to use things like natural language processing and deep learning to help customers query their genetic data on demand.
It’s not totally surprising that Veritas bought up Curoverse. Both companies spun out of George Church’s prolific Harvard lab. Several years ago, Church started something called the Personal Genomics Project, with the goal of sequencing 100,000 human genomes—and linking each one to participants’ health information. Veritas’ founders helped lead the sequencing part—starting as a prenatal testing service and launching a $1,000 full genome product in 2015—while Curoverse worked on academic strategies to store and sort through all the data.
But more broadly, genomics and AI practically call out for one another. As a raw data format, a single person’s genome takes up about 150 gigabytes. How!?! OK so, yes, storing a single base pair only takes up around two bits. Multiply that by roughly 3 billion—the total number of base pairs in your 23 chromosome pairs—and you wind up with around 750 megabytes. But genetic sequencing isn’t perfect. Mirza Cifric, Veritas Genetics’ cofounder and CEO, says his company reads each part of the genome at least 30 times in order to make sure their results are statistically significant. “And you gotta keep all that data, so you can refer back to it over time,” says Cifric.
That’s just storage. “Everything after that is going to specific areas and asking questions: There’s a variant at this location, a substitution of this base, a deletion here, or multiple copies of this same gene here, here, and here,” says Cifric. Now, interpret all that. Oh, and do it across a thousand, hundred thousand, or million genomes. Querying all those genetic variations is how scientists get leads to find new drugs, or figure out how existing drugs work differently on different people.
But cross-referencing all those genomes is just the beginning. Curoverse, which was focusing on projects to store and sort genomic data, also has its work cut out for it in searching through the 6 million—and counting—jargon-filled academic papers detailing gene behavior, including visual information found in charts, graphs, and illustrations.
That’s pretty ambitious. Natural language processing is one of the stickiest problems in AI. “Look, I am a computer scientist, I love AI and machine learning, and no amount of coding makes sense to solve this,” says Atul Butte, the director of UCSF’s Institute of Computational Health Sciences. At his former job at Stanford University, Butte actually tried to do the same thing—use AI to dig through genetics research. He says in the end, it was way cheaper to hire people to read the papers and input the findings into his database manually.