Washington University Informatics Researchers Describe Value of Synthetic Data

Researchers from the Institute for Informatics at Washington University School of Medicine in St. Louis say their use of “synthetic data” has created new possibilities for research because they don’t need to get approval from the institutional review board (IRB) for each new study.

During a recent webinar presentation, Randi Foraker, Ph.D., director of the Center for Population Health Informatics at the Institute for Informatics and an associate professor of general medical sciences at Washington University School of Medicine, described her organization’s use of a research data lake using tools from an Israel-based company called MDClone.

MDClone creates a synthetic copy of healthcare data collected from actual patient populations. While the synthetic data set is virtually identical to the original data, there's no identifying information that can be traced back to individual patients. Different from de-identified data that strips certain data elements, a synthetic data set is created from the original data and creates a new population that is designed to mirror the statistical characteristics of the original population, but it's populated with novel synthetic patients. As a population, they have overall characteristics that mimic the original characteristics of the data. MDClone says synthetic data can help reduce cycles of discovery from months and years to hours and days through the self-service platform.

“We have an MDClone instance at Washington University in St. Louis,” Foraker said. “We populate the data lake with clinical data collected at the point of care, and those comprise what we call the Health Data Core (HDC). Then in the Institute for Informatics, we collect those data from the HDC and add some research data from other sources. We overlay on these data some self-service tools, and this portfolio constitutes the Research Data Core (RDC). These are the data that are loaded into the MDClone healthcare data lake. Then the MDClone query tool is used to generate that computationally derived or synthetic data that's based upon that original cohort that you described in the query tool.”

Intermountain Healthcare in Utah has used the platform to build a program for managing its chronic kidney disease population. Foraker explained how Washington University School of Medicine has developed machine-learning model for predicting sepsis. They also have used the synthetic data to evaluate different machine learning and deep learning approaches in order to identify and risk-stratify heart failure patients.

“Researchers can log into the query tool and download the data that they want, on the cohort that they specify, and conduct an analysis. There isn't the regulatory hurdle of getting institutional review board approval, because the computationally derived data no longer constitute human subjects data.”

For example, she explained, researchers can look at the synthetic data in their MDClone data lake to see how many cases of sepsis the health system had in a one-year period or how many diagnoses of breast cancer over a five-year period. “A user can do that very quickly,” Foraker said. “They can also download the data and write up an abstract for submission to a scientific meeting.”

Foraker and other researchers have used the synthetic data exclusively to produce publications. “When I say exclusively, I mean that we present the results from the computationally derived data, and we don't do statistical validation with the original data, because we've conducted a sufficient number of use cases to demonstrate, at least for some analyses, that the relationships and the correlations and the results should be the same between the computationally derived data and the original data.”

Spinal surgery research

During the webinar, Jacob Greenberg, M.D., a postdoctoral research fellow and resident physician in the department of neurosurgery at Washington University School of Medicine, spoke about using the synthetic data on a study involving spine surgery patients. He used it to study 30-day readmissions, which is a common clinical outcome studied, specifically after anterior cervical fusion, which is one of the most common spine surgeries performed.

“When using synthetic data, it's important to think about what you actually want out of your research platform, depending on what type of question you're asking,” he said. “So for the types of questions that I'm typically interested in, it's important to be able to integrate multimodal data — certainly clinical diagnoses and labs, imaging and patient-reported outcomes are often important. It's important to integrate information across health settings — inpatient and outpatient data. And although spine surgery is common, when you're trying to study certain particular outcomes, like readmission rates or certain types of complications, those can be relatively rare, so if you're trying to look for risk factors, it's important to generate sufficient sample sizes.”

So why does Greenberg think synthetic data is important for this type of research? “One important reason is the potential for multicenter collaborative efforts,” he said. There are existing platforms for multicenter collaborations now, but when you're trying to share raw protected health information across centers, often issues related to data ownership and privacy come up. Simplifying that process really streamlines the collaborative effort, he said. “There are many instances where we're trying to conduct the initial analyses and the synthetic data platform can streamline that,” he said. As a first case study, we looked to just found a use of synthetic data to study 30-day readmissions, which is a common clinical outcome that we and others look at, specifically after anterior cervical fusion, which is one of the most common spine surgeries that we perform.

“We've found that synthetic data closely mimics the descriptive profiles of real data, and also performed similarly in predicting 30-day readmission,” Greenberg said. “Our goal going forward is to delineate the best use cases for synthetic data in spine surgery.”

Use by National COVID Cohort Collaborative

Foraker also mentioned that the National COVID Cohort Collaborative (N3C) uses an instance of MDClone. “I'm involved with the synthetic data workstream with that group, and we have designed and carried out a distinct set of three use cases in the National COVID Cohort Collaborative using COVID-19 data from over 30 different institutions. So I think that there are some really interesting dimensions here because the institutions have agreed to share their data, and they're putting together a limited data set from all of those institutions. Those data are also loaded into the MDClone data lake for access by the N3C investigators.”

Foraker also spoke about how the synthetic data can be valuable for educational purposes. “I am in the midst of teaching a biomedical informatics course this semester. And one of the modules or projects that we have assigned the students is to get on-boarded into MDClone and trained on how to use the query tool,” she said. “They can then download synthetic data and create an abstract that they could submit to a conference following conference abstract criteria, and then putting together a poster of the results.”

“What's really nice is that the students can get access to it right away,” she said. “They can complete this project in the eight weeks that they have left in the semester. And they're not going to be slowed down by regulatory processes, or any sort of data brokerage that would need to happen when using the original data. Why I think that this is so valuable for students like ours to have access to data like this because the electronic health record data aren't perfect. You can't answer every question with those data. They're also messy and they're not complete. There are issues of missing data and incomplete data values. And I think it's really critical for students to have access to this real-world data, to give them data management skills, and to show them how you need to make decisions about data, even before you start analyzing the data. I think that these types of data have a lot of promise to us academically and in our education programs. And it can serve as a really useful tool for educating students and trainees on real world data.”

Last July MDClone announced the creation of a Global Network of health systems that will use the platform, installed across the Global Network sites, to develop solutions and explore ideas together to improve patient health. Among the health systems involved are Intermountain Healthcare, Jefferson Health, Washington University in St Louis, Regenstrief Institute, Jewish General Hospital and the Ottawa Hospital in Canada, and Sheba Medical Center in Israel.