Artificial Human Genomes Could Help Overcome Research Privacy Concerns

Computers are taking steps towards creating novel human genomes.

Photo: Mario Tama/Getty Images (Getty Images)

It can be difficult to distinguish a flesh-and-blood human’s face from one generated by artificial intelligence. (Telltale signs are misshapen eyes under spectacles, non-dermatological blemishes on the skin, and hair that looks as if it’s been thatched atop the head, though your experience may differ). But what about when those impostors are more than skin deep; what if the computer-generated humans are described on a genetic level?

A team of geneticists and computer scientists have been using neural networks to construct novel segments of human genomes, according to a paper published in the journal PLOS Genetics. Their work could help sidestep the privacy issues inherent in working with real people’s DNA.

“Many biobanks, including the Estonian Biobank, require application procedures and ethical clearances for access. These steps are crucial because genomic data is sensitive data, and it’s important to keep the privacy of donors. On the other hand, this creates a scientific barrier,” Burak Yelmen, a geneticist at the University of Tartu in Estonia and lead author of the new paper, said in an email. “Artificial genomes might play an important role in the future as high-quality surrogates of real genome databases, making them easily accessible to researchers around the globe.”

Genetic data offers what is perhaps the largest ethical minefield in medical privacy, due to the power genes have in defining us. The research team used bits of accessible (read: legally obtainable) genetic information to train their networks, which were able to independently develop chunks of imaginary genome data that were nearly indistinguishable from actual genetic information. There were a few giveaways, Yelmen said, including the way that the artificial chunks of DNA were assembled. Different bits of genetic information were color-coded, or “painted,” to see their locations in the final product, and the team found more short chunks of artificial DNA were being produced than would be expected based on actual human genomic samples.

G/O Media may get a commission

The team was unable to generate entire artificial genomes due to computational and algorithmic limitations, but they suggested “stitching” multiple chunks together to get the complete genomic idea for one made-up individual.

“The training of the model is the bottleneck here. Once the model is trained, you can generate as many artificial genomes as you want in seconds,” Yelmen said. “Training of a 10,000-position genome chunk can vary dramatically depending on multiple factors.” With so many positions—referring to the locations of a nucleotide base pairs that will occur at any given place in the genetic code—Yelman said the models can sometimes have a difficult time generating accurate results out of randomness.

A chromosome (genetic material) superimposed on binary code.

Image: Burak Yelmen

The deep learning involved in the research used two different approaches. One involved generative adversarial networks, which use two neural networks in their process; the first (the “generator”) created possible instances, or sets of data that the model can learn on. In this case, the datasets were randomly generated lines of genetic codes. The other network was the “discriminator,” which assessed the validity of the former. This output was fed back into the generator for more accurate attempts down the line. The other approach was a restricted Boltzmann machine, which is a two-layered neural net that learns structures over time, helping it produce better results going forward. For the most part, generative adversarial networks are the preferred method for deep learning.

The team’s generative adversarial network took a couple of days to train entirely using one graphics processing unit, Yelmen added. GPUs are heavy-duty processors used for a variety of tasks, from detailed 3D rendering to deep learning.

“These genomes emerging from random noise mimic the complexities that we can observe within real human populations,” said co-author Luca Pagani, a geneticist also at the University of Tartu, in a release from the Estonian Research Council. “For most properties, they are not distinguishable from other genomes from the biobank we used to train our algorithm, except for one detail: they do not belong to any gene donor.”

Imagine a book that’s able to be continuously reorganized into a new, perfectly readable story, never revealing the original text. Facsimile genomes offer that possibility for future research, possibly without the worry of compromising any individual’s genetic code.