In an earlier post we have highlighted the necessity to develop appropriate measures to safeguard against privacy risks stemming from biomedical research. Among the possible mechanisms, cryptographic, and differential privacy techniques were some of the suggested methods that offer some tradeoffs between the privacy and the utility of the data.
In this post, we address the challenges of data protection from a data management perspective. We tell the story of scientists, working for different institutions, who are governed by rigorous data privacy policies, and who wish to share the results of their research. During the narration of our story, we peer into the most essential qualities expected from a research platform designed to allow such a collaboration.
We use biomedical research as the backdrop for the story, taking note that the same principles apply to other domains when data protection is a major concern.
Hospitals are picture-perfect example for this. They are custodians of patient data. They need this data for therapy decision-making. The data is highly confidential and must be protected. Yet, there is value in sharing this data between medical institutions. For instance, Alice, a scientist in a biomedical research labs analyses clinical data to discover correlations between biomarker profiles and successful therapies. She uses this information to compose a panoply of treatments customized for different biomarker profiles. Bob, a physician in a hospital will benefit from the experience gained in other hospitals, or from Alice’s results, when deciding of the best treatment for his patients. To this end, both institutions must be able to securely use each other’s data without losing control of their data sets, or compromising the privacy of their patients, or divulging the intellectual property of their analytics. Because hospitals must obey strict data governance rules, because they are mutually non-trusting, and because the data must physically stay within the borders of the hospital’s respective countries, they do not allow their data to leave their servers.
As illustrated in the above Figure, one way to achieve this level of control is to make it possible for the participating institutions to independently administer their research platforms (platform instances) and connect them in a federated structure governed by a federated identity management arrangement. Under this model, participating institutions interact in a flat hierarchy where mutual trust between different administrative domains is avoided, they are in full control of their respective platform services and resources, and manage them according to their own preferences. In this context, resources consist of data sets, algorithms, compute infrastructures (CPU, storage), and meta-data which describe the resources and express relationships between them. Note that a platform instance should not make any assumption about the implementation details of other platform instances, nor the safety of their execution context. In short, it should never entrust them with sensitive information.
From a user perspective, each platform should give the ability to seamlessly search the collective meta-data for resources, and access them anywhere in the federation as long as the permission rules specified by the owning institution allow it.
Going back to our earlier example, after Bob authenticates in the platform instance of his institution he is provided with a global view of all the meta-data visible to him from all the platform instances combined, including data and algorithms made publically available by Alice. He can explore this view and discover data sets of interest or research projects related to his. Alice is provided with a similar view (with visibility set to her own permission rights) after authenticating to the platform of her institution. Alice can use her view of the research world as a one stop shop to execute an algorithm on selected input data sets. Under the hood, the platform instances will cooperate to orchestrate the execution with respect to their respective rules. For instance, due to the private nature of the data, Bob’s institution could impose additional restrictions on how external entities such as Alice can access their data, such as being accessible only by authorized applications running on behalf of authorized persons or labs (Alice), and only on servers administered by Bob’s institution. Under this scenario, Alice does not see the input data, she only sees the results of the execution of her algorithms, stripped of all personally identifiable information.
Using this model as a basis for the secure collaborative research, we can build business logic layers to extend it with additional useful capabilities. We can take advantage of the fact that all resource access operations must be authorized by the platform instances owning the resources to automatically preserve a trace of the user activities from which a lineage can be derived, such as the algorithms execution and input data that were used to create an output data set, expressed in a form similar to the one illustrated in the figure.
Patients and hospital can use this lineage for attribution purposes, to find out who is using their data, how, and for what purposes.
Conversely, doctors and researchers, can verify the origin of the data and trace it to its sources through all the intermediate transformation steps (as long as the permissions allow it).
We, in the Swiss Data Science Center, are developing an open source middleware to enable such a collaboration according to the principles mentioned above. We call our platform RENKU (連句). As in the traditional art form of renku, the platform encourages the interdisciplinary cooperation (or coopetition) between data scientists like Alice and Bob to advance a research agenda through individual leap contributions. The platform is available free of charge from here (get-renga.io) – Stay tuned.
Eric Bouillet, Head of Engineering, Swiss Data Science Center