GitHub - CARRIER-project/v6-vertical-analysis

FAIRHealth Project: Privacy-Preserving Distributed Learning Infrastructure (PPDL)

Introduction

FAIRHealth project is a collaboration between Maastricht University and Statistics Netherlands from Feb 2018 to Feb 2020. It is funded by Dutch National Research Agenda (NWA) under VWData program. In this project, we propose an innovative infrastructure for the secure and privacy-preserving analysis of personal health data from multiple providers with different governance policies. The approach involves distributed machine learning to analyze vertically partitioned data (different variables/attributes/features about a particular individual are distributed over a set of data sources).

The main idea of our infrastructure is to send data-processing or analysis algorithms to data sources rather than transferring data to the researchers. Only the final (verified) results can be return to the researchers. Our infrastructure is an extension of Personal Health Train Archtecture. The trains (applications) containing analytic algorithms are sent to the data stations (sources). The stations (sources) can inspect whether the train is allowed to execute the application on (a subset of) the available data.

Please find our publications:

Structure of PPDL

Until Feb 2020, PPDL infrastructure contains 5 components:

Data transformation (Transform csv, sav data files to RDF data stored in graph database)
Overview of Data (Visualize and obtain basic information/statistical summary of data)
Pseudonymization & Encryption (Pseudonymize personal identifiers(PI) and encrypt data files)
Matching & Merging (Match and merge multiple datasets on pseudonymized PI)
Analysis (Go through machine learning pipeline)
Logging all data processing history

Prerequisites

Hardware:

Windows 10 (fall creators update or higher)
MacOS 10.13 (High Sierra)
Ubuntu 16.04, 17.10 or 18.04
Moderately recent CPU (minimum i5 processor)
8 GB of RAM (not occupied by many other applications/services)

Software:

Docker Community Edition
- native on Ubuntu Install
- for Windows Install
- for Mac Install
Python 3.6 (with pip as dependency manager)

How to use it? (Test on your local laptop)

Install base containers: in all data stations (data parties and Trusted Secure Environment), a basic container needs to to installed. In your terminal:
```
docker pull sophia921025/datasharing_base:v0.1
```
Get an overview of data: At each data party station, create a folder, put data file and request.yaml into this folder. Configure request.yaml based on the overview of data you need. In the folder which contains data file and request.yaml, Mac/Linux run: (please change the third line "data_party_1.csv" to the name of your own data file.)
```
docker run --rm \
-v "$(pwd)/input/request.yaml:/inputVolume/request.yaml" \
-v "$(pwd)/input/data_party_1.csv:/data_party_1.csv" \
-v "$(pwd)/output:/output" sophia921025/datasharing_overview:local0.1
```
Windows run (please change the third line "data_party_1.csv" to the name of your own data file.)
```
docker run --rm \
-v "%cd%/input/request.yaml:/inputVolume/request.yaml" \ 
-v "%cd%/input/data_party_1.csv:/data_party_1.csv" \ 
-v "%cd%/output:/output" sophia921025/datasharing_overview:local0.1
```

Generate public-priavte keys for encryption and verficiation keys transferring.

docker run --rm \
-v "$(pwd)/input/ppkeys_input.yaml:/inputVolume/ppkeys_input.yaml" \
-v "$(pwd)/output:/output" sophia921025/datasharing_ppkeys:local0.1

Pseudonymization and encryption**: to pseudonymize the personal identifiers (PI) for linking multiple datasets, and encrypt the data files (pseudonymized PI + actual data). Go to the folder which contains data file and encrypt_input.yaml. Please configure encrypt_input.yaml first. Then in the terminal (Mac/Linux): (please change the second line "data_party_1.csv" to the name of your own data file.)
```
docker run --rm \
-v "$(pwd)/input/data_party_1.csv:/data_party_1.csv" \
-v "$(pwd)/input/publicKey_dms.pem:/publicKey.pem" \
-v "$(pwd)/input/encrypt_input.yaml:/inputVolume/encrypt_input.yaml" \
-v "$(pwd)/output:/output" sophia921025/datasharing_encdata:local0.1
```
Windows (please change the second line "data_party_1.csv" to the name of your own data file.):
```
docker run --rm \
-v "%cd%/input/data_party_1.csv:/data_party_1.csv" \
-v "%cd%/input/encrypt_input.yaml:/inputVolume/encrypt_input.yaml" \
-v "%cd%/output:/output" sophia921025/datasharing_encdata:local0.1
```

After successful execution, your encrypted data file and key file (keys.json) will be stored locally/to the server (e.g., trusted third party, trusted secure environment).

Sign your model file (python script) by all data parties. Create a folder where contains your_model.py and encrypt_input.yaml (need to be configured). Then in the terminal, Mac/Linux (please change the second line "your_model.py" to the name of your own model file):

docker run --rm \
-v "$(pwd)/input/MLmodel_test.py:/MLmodel_test.py" \
-v "$(pwd)/input/sign_model_input.yaml:/inputVolume/sign_model_input.yaml" \
-v "$(pwd)/output:/output" datasharing_signmodel:local0.1

Windows (please change the second line "your_model.py" to the name of your own model file):

docker run --rm \
-v "%cd%/input/your_model.py:/your_model.py" \
-v "%cd%/input/encrypt_input.yaml:/inputVolume/encrypt_input.yaml" \
-v "%cd%/output:/output" sophia921025/datasharing_signmodel:local0.1

At Trusted Secure Environment (TSE), create a folder, put encrypted data files from data parties, security_input.yaml, and analysis_input.yaml, and your analysis python script (ML models) into this folder. Configure security_input.yaml based on the keys from data parties, and analysis_input.yaml based on your analysis requirements. In your terminal:

Mac/Linux:

docker run --rm \
-v "$(pwd)/input:/input" \
-v "$(pwd)/output:/output" \
-v "$(pwd)/input/security_input.yaml:/inputVolume/security_input.yaml" \
-v "$(pwd)/input/analysis_input.yaml:/inputVolume/analysis_input.yaml" \
sophia921025/datasharing_tse:v0.1

Windows:

docker run --rm \
-v "%cd%/input:/input" \
-v "%cd%/output:/output" \
-v "%cd%/input/security_input.yaml:/inputVolume/security_input.yaml" \
sophia921025/datasharing_tse:v0.1

If Docker container runs properly, you will see execution logs as below. In the end, all results and logging histories (ppds.log) are stored in the output folder. To avoid data leakage from error shooting, if errors occur during executions, the error messages will saved in the ppds.log instead of printing out on the screen.

INFO     ░ 2020-02-02 19:40:56,751 ░ verDec ░ verDec.py line 14 ▓ Reading request.yaml file...
INFO     ░ 2020-02-02 19:40:56,944 ░ verDec ░ verDec.py line 111 ▓ Signed models has been verified successfully!
INFO     ░ 2020-02-02 19:40:56,945 ░ verDec ░ verDec.py line 151 ▓ Verification and decryption took 0.3028s to run
... 
... ...
INFO     ░ 2020-01-19 10:25:05,619 ░ main ░ main.py line 272 ▓ In total, all models training took 16.6441 to run.

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
DataPartyStation		DataPartyStation
DataforTesting		DataforTesting
TSEStation		TSEStation
baseContainer		baseContainer
.DS_Store		.DS_Store
.Rhistory		.Rhistory
.gitignore		.gitignore
.zenodo.json		.zenodo.json
CITATION.cff		CITATION.cff
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FAIRHealth Project: Privacy-Preserving Distributed Learning Infrastructure (PPDL)

Introduction

Structure of PPDL

Prerequisites

How to use it? (Test on your local laptop)

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

CARRIER-project/v6-vertical-analysis

Folders and files

Latest commit

History

Repository files navigation

FAIRHealth Project: Privacy-Preserving Distributed Learning Infrastructure (PPDL)

Introduction

Structure of PPDL

Prerequisites

How to use it? (Test on your local laptop)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages