Dana Farber Researchers Address Oncology Data-Sharing Issues

Dana Farber Cancer Institute (DFCI) researchers in Boston are seeking to address data-sharing challenges associated with oncology research by working with a company called Duality on new approaches to preserving the privacy of sensitive data. Alexander Gusev, Ph.D., associate professor of medicine at Dana Farber and Harvard Medical School, and Adi Hirschstein, vice president of product at Duality, recently spoke with Healthcare Innovation about the new approach they are taking.

Healthcare Innovation: First, could you describe some of the work your lab at Dana Farber is doing?

Gusev: I’m a statistical geneticist by training. So even though I'm in a medical school, my background is not clinical, it's statistical. I am very interested in using computational, algorithmic and now artificial intelligence approaches to questions in oncology. We have a lot of interest in understanding the patient-level response and processes in the context of immunotherapy. In a lot of advanced disease that used to be basically hopeless, now you can activate the patient's immune system, and in some cases, lead to full cures for those patients. But some patients don't respond at all, and some patients actually get worse than they would on a conventional treatment when they go on immunotherapy. A pretty sizable portion of patients develop toxicities, so they actually have an overreaction to the treatment that ends up being worse than if they hadn't been treated at all.

There are a lot of decision points around that, so we've been trying to integrate lots of data sources. One source of data is genetic, and we had a study a couple years ago in Nature Medicine where we showed that inherited genetic variation is very strongly associated with developing these immune-related adverse events. In some cancers, that would indicate that you shouldn't go on immunotherapy if you are a carrier of that polymorphism, because the toxicity isn't worth it, and there are other treatments that are more effective for you.

HCI: What are some of the issues with data sharing or access to data that you deal with and that this partnership with Duality might help address?

Gusev: The toxicities analysis was done entirely with genetic data. We are also interested in using information from digital images. We want to use the genetic features that we already know are associated with toxicities, and combine them with this image data to identify the actual cellular populations and changes that are predictive of outcomes as well.

There's a global challenge in academia of patient and clinical data sharing, because it is almost always coming from sensitive patient groups, sometimes who have not consented to have their data shared. But even if they have consented to data sharing, they are still always concerned about de-identification with genetic data. That is a bit less of a concern in the sense that there are ways to de-identify genetic data for sharing, and that's something that the NIH and other organizations have been working to set up protocols for. For imaging data, all bets are off because it's essentially unstructured. The other thing that I actually didn't realize is that a lot of times oncologists will write information on the slide. So they'll write the patient's Social Security number or their medical record number, or their name, or even just their own name, which is still identifiable. So there's actually a lot of identifying info just written in Sharpie on these slides, and that presents an extreme challenge for de-identification, because it either can't be done, because you would be removing parts of the actual image, or it is extremely manually labor-intensive to go through and identify these various issues.

The digital sides we have been analyzing internally is where we realized that there is all of this identifying information all over them, and those are exactly the kind of data sets that we would like to work with across institutions. The cross-institutional use of these slides is, I think, even more important than it is for genetic data, because every institution has their own slightly different way of digitizing or cutting the slides, of compressing them on their own. Some of these subtle patterns, like what they write on the slide, will sometimes be an indication of how severe the patient is. So cross-institutional validation is really important.

HCI: Adi, could you describe Duality’s work in this space?

Hirschstein: Duality was founded a few years ago with a very clear vision to help multiple organizations to collaborate on sensitive data. Duality works in industries where sharing the data is challenging on one hand, but very beneficial. There are many cases in financial industries and in government and obviously in healthcare, where you take multiple organizations and you provide them the ability to run computation, whether it's machine learning, whether it's queries or statistical computation, across organizations, so they gain new insights in a way you couldn't do before. The challenge is, obviously, that the data is sensitive, so how can you run a computation on top of data that you cannot access? In order to do that, Duality came up with a platform that has different types of technologies. Our product vision is to use best of breed in terms of the privacy technology that we're using. So we started with a specific technology called homomorphic encryption, which basically provides you the ability to take encrypted data and run operational computation on it without decrypting it.

Over time, we added other technologies such as federated learning. With federated learning, you can actually train the model locally. The data never leaves. So by definition, the data is fully protected. But even so, when you run federated learning, then you need to aggregate the intermediate results right across the multiple institutions, and those could reveal some information. And in order to fully protect that flow, we are adding another technology called Trusted Execution Environment, which is basically a hardware-based technology to protect your data. This type of technology is being offered as a service in the cloud and directly integrated with the platform. So in some cases, we're actually running use cases with multiple privacy-enhancing technologies in order to best protect the data.

[In a paper published in the Proceedings of the National Academy of Sciences, Gusev and other researchers explained how using a federated model allows multiple institutions with their own clinical and genomic data to perform secure joint analyses across all patients without decrypting the underlying individual-level values. In a statement, Ravit Geva, M.D., deputy director of the Oncology Division and head of the Clinical Research & Innovation unit of the oncology division at Tel Aviv Sourasky Medical Center, said, “Our joint study with Duality aimed and verified the accuracy of statistical oncology endpoints when done through encrypted data. The secure analysis yields accurate results compared with the currently used conventional data management and analysis methods on Collaborative Real-world Oncological analyses without revealing patients' protected health information."]

HCI: I’ve written about federated data models like PCORnet, where, as I understand it, the research question goes out to the sites, rather than creating a central data warehouse to run queries on. Is that a similar approach?

Hirschstein: Yes. And on top of the privacy challenge, there is also an operational challenge. Even if you could take the data and put it in a centralized place, every image is around one gigabyte. And if you have tens of thousands of gigabytes across multiple centers, that ends up with a pretty big amount of of data. And moving around this data is not practical on an ongoing basis.

HCI: So Prof. Gusev, do you have to reach out to other medical centers that you want to share data with and explain this concept and get them comfortable with it to make this happen?

Gusev: Yes, that's what we're in the process of doing. We have some close collaborators at the moment, some internally. Even within the institution, you oftentimes still have to have formal collaboration agreements for sensitive data. We have a close collaborator at Mass General Hospital, which, again, it's a Harvard hospital, but it's its own institution, so formal data-sharing collaborations have to be formed. They've been working with us on this project, and in doing this across two institutions, our hope is that from there we can recruit others, and we've been talking informally with folks at Sloan Kettering and UCSF to show that this can work in a plug-and-play way for two hospitals. I think that'll be the practical way to convince people that this can continue to work at a larger number of institutions.

HCI: When you're sharing oncology data across institutions like that, are there also data model issues in terms of how data is represented in different systems?

Gusev: For images, data modeling is a bit less of an issue because, ultimately, the input is the same. It is a digital representation of a photograph. This problem comes up a lot in the tabular healthcare data space, like electronic health records. There, model structures are really difficult. We run into this a lot for toxicities, because that's not a fully standardized observation. So at some institutions, if somebody has a toxicity in response to a drug, they'll just put “cancer” into the EHR. Other people will put in “autoimmune condition,” and other people will put in exactly the specific thing the person experienced. That human variation, which becomes cultural at different institutions, is really challenging. That is why it is important for model validation to happen across different institutions, and why we're excited about doing this. If you have a model that's predictive in Boston and in San Francisco and in Mexico, the chance that there are biases all lining up in the same way is much lower. So from a scientific perspective, even outside of the logistics, this is really important.

HCI: Is there anything else about the effort I haven't asked about that you want to stress?

Gusev: The ability to move through different levels of security — to either have just a federated approach where nobody has access to anybody else's data, or, on top of that, have a Trusted Execution Environment where even those individual data analyses are done in highly secure environments — that kind of flexibility is something that's pretty unique that I haven't seen from other tools. I think, especially as we try to expand this out to other institutions, they may have additional restrictions that they want to impose on their individual unit, and this software service allows us to do that. So that's also the future-proofing nature of this. If somebody wants something even more secure, we can toggle that on for them.