Preparing Real-World Radiation Oncology Data for Machine Learning

The University of Michigan is the U.S. epicenter for research on learning health systems. During a recent presentation there, Charles Mayo, Ph.D., director of radiation oncology informatics and analytics in U-M’s Department of Radiation Oncology, described the Michigan Radiation Oncology Analytics Resource (M-ROAR), which pulls together data from multiple sources to prepare for machine learning and analytics.

“We began about six years ago, asking what we were going to need to do to construct a system to be able to improve our ability to use the electronic healthcare data that we have,” Mayo said.

He added that in clinical practice and research areas, there are a number of different systems that are not consistent in how they represent data. “We needed a way to aggregate data into a single system to be able to integrate not just the data, but the concepts, so that we know how to tie this back to clinical practice,” Mayo said. “We gather data from multiple systems, other institutional databases, our treatment planning systems, and Epic into an application server. We clean up the data and form those integrations, then that data goes to an SQL Server that allows us to carry out operations like text searching, machine learning and analysis, and constructing dashboards.”

The M-ROAR project has more than 47,000 patients in in the database and millions of longitudinal records. “It’s not just about radiation oncology. We understand patient-reported outcomes, but we also understand charges, diagnoses, toxicities, notes, encounters, and medications,” Mayo said.

In terms of the data pipeline, they started out looking at local registries. Those ultimately feed into what happens in institutional registries, he said. “We then look to colleagues at other institutions for the potential of multi-institutional registries. Most of those efforts really grow out of research,” he added, “but increasingly, as we deal with researchers from our own and other institutions, we see where this can guide healthcare policy with better, more comprehensive data.”

However, many of these data sets tend not to be artificial intelligence-ready, Mayo said. “Just having the data in a bucket doesn't make it ready for the applications that are using artificial intelligence. To use it, a lot of effort goes into cleaning that data and organizing it so it can make sense. We are presently working with many investigators carrying on these efforts manually, and understanding how AI can come in with big data to be useful. Ultimately, this should be pointing toward a future where all this is much more automated and standardized.”

“When we looked at the barriers, and the models that are used for how to gather the data, we realized that the common term of data mining actually misses the point,” Mayo said. “What we're needing to do is build a data culture. Being able to construct a learning health system is not just the technology; it's also recognizing that we need to learn to plant in straight rows. How are we going to lay out data so that we can start to bring in machinery to be able to use that?” He said they are working on reducing the variability in the data and increasing the veracity. They are starting to be able to get a high volume of data that can be gathered very quickly.

In a paper on the topic, Mayo and colleagues said that data farming is perhaps a more accurate term for what they are doing than data mining. “The objective is to harvest large volumes of data that we could use as raw materials for analyzing healthcare patterns and outcomes. Like the farmer who considers the implication of every part of the sowing, growing, and harvesting process on the yield of high-quality grain, we need to examine how best to use the tools available in our electronic systems to increase the volume of actionable data that are readily available. High-quality data sources rarely exist independent of our efforts, just waiting to be found, or mined. They result from intent and dedication of resources to grow these data sources and curate (weed out) misleading information.”

Mayo said they are increasingly aware of the ethical challenges involved in doing this work. “Not only do we gather clinical information, demographic information about patients, increasingly, the possibility of adding genomic information comes into this,” he said. “As stewards of patient data, we want to make advances, but we need to maintain trust, and think about the risks or where the data may go in ways that we didn't anticipate and how it may be modeled in a way which we didn't anticipate, so it makes our responsibilities broader.”

As they are motivating people to use these systems and embrace standardization, Mayo said, they also commit to showing value to them for collaborating, Mayo said. For instance, he said, they have dashboards to pinpoint where patients are coming from, what kinds of diseases they have, and chart longitudinal progress of lab values, and understand which patients are treated on which machines.

Another use case involves helping with expansion planning. “We had a site that was looking to add another linear accelerator, which we use for treating patients, and there were questions about the catchment areas — where the patients are coming from. With this system, it becomes very straightforward to start to answer these questions, so it's not all about research; it's also about clinical practice,” Mayo explained.

In addition to working with clinicians to standardize notes in clinics to find similar data, Mayo said they are working with researchers at other institutions to standardize nomenclature. That will make it easier to find data and in the long run to centralize data. He said there are several efforts under way, including HL7’s mCODE, to design better communication between systems, so that researchers can extract treatment summary information in a uniform fashion.

Mayo contrasted traditional clinical trials with emerging trials using real-world data gathered from routine practice. “From our standpoint, we don't really see it as an either/or approach,” he said. “We see what you can learn from routine clinical EHR data as constructing a pathway for better-formed, data-driven hypotheses that then later come into clinical trials, and for forming a pathway to go back and carry out validation studies,” he said.