NIH Initiative to Build Data Sets for Developing AI Technologies

The National Institutes of Health has announced it will invest $130 million over four years to accelerate the widespread use of artificial intelligence (AI) by the biomedical and behavioral research communities.

The NIH Common Fund’s Bridge to Artificial Intelligence (Bridge2AI) program has issued four awards for data generation projects, and three to create a Bridge Center for integration, dissemination and evaluation activities. The data generation projects will generate new biomedical and behavioral data sets ready to be used for developing AI technologies, along with creating data standards and tools for ensuring data are findable, accessible, interoperable, and reusable, a principle known as FAIR, NIH said.

In addition, data generation projects will develop training materials that promote a culture of diversity and the use of ethical practices throughout the data generation process. The Bridge Center will be responsible for integrating activities and knowledge across data generation projects, and disseminating products, best-practices, and training materials.

Bridge2AI will produce a variety of diverse data types ready to be used by the research community for AI analyses. These types include voice and other data to help identify abnormal changes in the body. Researchers will also generate data that can be used to make new connections between complex genetic pathways and changes in cell shape or function to better understand how they work together to influence health. In addition, AI-ready data will be prepared to help improve decision making in critical care settings to speed recovery from acute illnesses and to help uncover the complex biological processes underlying an individual’s recovery from illness.

Participating research institutions have shared some details about what they will be working on. One of the four data generation projects, Voice as a Biomarker, co-led by the University of South Florida and Weill Cornell Medicine, will bring together medical, voice, AI, engineering, and ethics experts to create a human voice database using privacy-preserving AI, giving doctors a new tool for diagnosing conditions known to have associations with voice alterations.

Based on the existing literature and ongoing research, the research team has identified five disease cohort categories for which voice changes have been associated with specific diseases with well-recognized unmet needs. Data collected for this project will center on the following disease categories:

• Voice disorders: (laryngeal cancers, vocal fold paralysis, benign laryngeal lesions)

• Neurological and neurodegenerative disorders (Alzheimer’s, Parkinson’s, stroke, ALS)

• Mood and psychiatric disorders (depression, schizophrenia, bipolar disorders)

• Respiratory disorders (pneumonia, COPD)

• Pediatric voice and speech disorders (speech and language delays, autism)

Federated learning technology – an AI framework that allows machine learning models to be trained on data without the data ever leaving its source – will be deployed across multiple research centers by French-American biotech startup company Owkin to demonstrate that cross-center AI research can be conducted while preserving the privacy and security of sensitive voice data.

Shannon McWeeney, Ph.D., chief data officer for the Oregon Health & Science University (OHSU) Knight Cancer Institute, will co-lead tool development to support one of the “grand challenge” projects in the Bridge2AI program, aimed at generating new biomedical and behavioral data sets that are ethically sourced, trustworthy, well-defined and accessible. The collaborative team will leverage $7.8 million in funding for the first year to develop software and standards to unify data attributes across multiple data sources and types.

“Moving the field of AI forward is essential to help detect and treat earlier diseases like cardiovascular disease, diabetes and cancer,” says McWeeney, a professor and head of the Division of Bioinformatics and Computational Biology in the OHSU School of Medicine, in a statement. “The ability to understand and affect the course of complex, multi-system diseases has been limited by a lack of well-designed, high-quality, large, and inclusive multimodal datasets. We need transparency about how the data are generated with regard to any bias or uncertainties, and to ensure they are ethically sourced. We also need to lower the barrier for researchers to be able to use AI-based tools in their future research.”

Other institutions collaborating on the Data Generation project include: University of Washington, California Medical Innovations Institute, Johns Hopkins University, University of California at San Diego, University of Pennsylvania, Stanford University, Native BioData Consortium, University of Alabama at Birmingham, University of Mississippi Medical Center, Henry Ford Health System and Microsoft.

David Dorr, M.D., M.S., chief research information officer and professor of medical informatics and clinical epidemiology in the OHSU School of Medicine, will co-lead another of the grand challenge projects, “Skills and Workforce Development,” with a team from Washington University in St. Louis.

This module will be centered on bridging expertise across people in the biomedical and behavioral research domains to develop an AI/machine-learning research workforce. Dorr says this project is designed to enhance skill development and attract and develop a specialized workforce.

Researchers at University of California San Diego and University of California San Francisco are expected to receive nearly $20 million in the next four years to launch Cell Maps for AI, a research project designed to usher in a new era of precision medicine. The team envisions a future in which an AI algorithm could analyze a patient's genome and decipher which disease they have, what stage they are in and which treatments are most likely to help. Importantly, they say the algorithm must be interpretable, such that a physician could point to the molecular and cellular pathways that inform its decisions.

"It's not enough for an algorithm to just take a complex set of mutations and decide what drug to give a patient if we don't know why it's making that choice," said Trey Ideker, professor at UC San Diego School of Medicine, in a statement. "We may now have enough human genomes sequenced to power precision medicine, but what we don't have yet is a clear map of cellular biology to interpret the data with."

To address this, the project aims to map the structure and function of a human cell in its entirety, starting with the most basic cell type: the stem cell. The researchers will obtain induced pluripotent stem cells from a variety of genetic backgrounds and combined microscopy, biochemistry and computational tools to study their biology at multiple scales. The final product will be a comprehensive model of the cell, from genes and proteins to entire organelles and how they all work together. Once the stem cell has been modeled, they plan to use the same approach to model other cells, such as those that are dividing, differentiating or in various disease states.

Their goal is to eventually have a library of cell maps across many demographic and disease contexts, which can be used to train AI algorithms to make informed and interpretable decisions about human health.