Natália Ružičková Institute of Science and Technology Austria
Many statistical models and algorithms scientists use can be imagined as a “black box.” These powerful models give accurate predictions, but their internal workings are not easily understood. In an era dominated by deep learning, where an ever-increasing amount of data can be processed, Natália Ružičková, a physicist and PhD student at the Institute of Science and Technology Austria (ISTA), chose to take a step back at least in the context of genomic data analysis.
Ružičková, along with recent ISTA graduate Michal Hledík and Professor Gašper Tkačik, has proposed a model to analyze polygenic diseases—conditions where multiple regions of the genome contribute to dysfunction. This model also aids in understanding the role of these identified genomic regions in developing these diseases. Their research provides valuable findings by integrating advanced genome analysis with fundamental biological insights. The results have been published in the Proceedings of the National Academy of Sciences (PNAS).
Decoding the human genome
In 1990, the Human Genome Project was launched to decode human DNA fully—the genetic blueprint that defines humanity. By 2003, the project was completed, leading to numerous scientific, medical, and technological breakthroughs. By deciphering the human genetic code, scientists aimed to learn more about diseases linked to specific mutations and variations in this genetic map. The human genome comprises approximately 20,000 genes and even more base pairs, which are the letters of the blueprint. This complexity made ample statistical power essential, resulting in the development of “genome-wide association studies” (GWAS).
GWAS approach the issue by identifying genetic variants potentially linked to organismal traits such as height. Notably, they also include the propensity for various diseases. The underlying statistical principle is relatively straightforward: participants are divided into two groups—healthy and sick individuals. Their DNA is then analyzed to detect variations—changes in their genome—that are more prominent in those affected by the disease.
An interplay of genes
When genome-wide association studies emerged, scientists expected to find just a few mutations in known genes linked to a disease that would explain the difference between healthy and sick individuals. The truth, however, is much more complicated. “Sometimes, hundreds or thousands of mutations are linked to a specific disease,” says Miss Ružičková. “It was a surprising revelation and conflicted with our understanding of biology.”
Each individual mutation contributes only minimally to the risk of developing a disease. However, when combined, these mutations can provide a better—though not complete—understanding of why some individuals develop the disease. Such diseases are known as “polygenic.” For instance, type 2 diabetes is considered polygenic because it cannot be attributed to a single gene; rather, it involves hundreds of mutations. Some of these mutations influence insulin production, insulin action, or glucose metabolism, while many others are found in genomic regions that have not been previously linked to diabetes or have unknown biological functions.
The omnigenic model
In 2017, Evan A. Boyle and colleagues from Stanford University proposed a new conceptual framework called the “omnigenic model.” They proposed an explanation for why so many genes contribute to diseases: cells possess regulatory networks that link genes with diverse functions.
“Since genes are interconnected, a mutation in one gene can impact others, as the mutational effect spreads through the regulatory network,” Ružičková explains. Due to these networks, many genes in the regulatory system contribute to a disease. However, until now, this model has not been formulated mathematically and has remained a conceptual hypothesis that was difficult to test. In their latest paper, Ružičková and her colleagues introduce a new mathematical formalization based on the omnigenic model named the “quantitative omnigenic model” (QOM).
Combining statistics and biology
To demonstrate the new model’s potential, they needed to apply the framework to a well-characterized biological system. They chose the typical lab yeast model Saccharomyces cerevisiae, better known as the brewer’s yeast or the baker’s yeast. It is a single-cell eukaryote, meaning its cell structure is similar to that of complex organisms such as humans. “In yeast, we have a fairly good understanding of how regulatory networks that interconnect genes are structured,” Miss Ružičková says.
Using their model, the scientists predicted gene expression levels—the intensity of gene activity, indicating how much information from the DNA is actively utilized—and how mutations spread through the yeast’s regulatory network. The predictions were highly efficient: The model identified the relevant genes and could clearly pinpoint which mutation most likely contributed to a specific outcome.
The puzzle pieces of polygenic diseases
The scientists’ goal was not to outdo the standard GWAS in prediction performance but rather to go in a different direction by making the model interpretable. Whereas a standard GWAS model works as a “black box,” offering a statistical account of how frequently a particular mutation is linked to a disease, the new model also provides a chain-of-events causal mechanism for how that mutation may lead to disease.
In medicine, understanding the biological context and such causal pathways has huge implications for finding new therapeutic options. Although the model is far from any medical application, it shows potential, especially for learning more about polygenic diseases. “If you have enough knowledge about the regulatory networks, you could also build similar models for other organisms. We looked at the gene expression in yeast, which is just the first step and proof of principle. Now that we understand what is possible, one can start thinking about applications to human genetics,” says Miss Ružičková.