TY - JOUR
T1 - Systematically linking tranSMART, Galaxy and EGA for reusing human translational research data
AU - Zhang, Chao
AU - Bijlard, Jochem
AU - Staiger, Christine
AU - Scollen, Serena
AU - van Enckevort, David
AU - Hoogstrate, Youri
AU - Senf, Alexander
AU - Hiltemann, Saskia
AU - Repo, Susanna
AU - Pipping, Wibo
AU - Bierkens, Mariska
AU - Payralbe, Stefan
AU - Stringer, Bas
AU - Heringa, Jaap
AU - Stubbs, Andrew
AU - Bonino Da Silva Santos, Luiz Olavo
AU - Belien, Jeroen
AU - Weistra, Ward
AU - Azevedo, Rita
AU - van Bochove, Kees
AU - Meijer, Gerrit
AU - Boiten, Jan Willem
AU - Rambla, Jordi
AU - Fijneman, Remond
AU - Spalding, J. Dylan
AU - Abeln, Sanne
N1 - Publisher Copyright:
© 2017 Zhang C et al.
PY - 2017
Y1 - 2017
N2 - The availability of high-throughput molecular profiling techniques has provided more accurate and informative data for regular clinical studies. Nevertheless, complex computational workflows are required to interpret these data. Over the past years, the data volume has been growing explosively, requiring robust human data management to organise and integrate the data efficiently. For this reason, we set up an ELIXIR implementation study, together with the Translational research IT (TraIT) programme, to design a data ecosystem that is able to link raw and interpreted data. In this project, the data from the TraIT Cell Line Use Case (TraIT-CLUC) are used as a test case for this system. Within this ecosystem, we use the European Genome-phenome Archive (EGA) to store raw molecular profiling data; tranSMART to collect interpreted molecular profiling data and clinical data for corresponding samples; and Galaxy to store, run and manage the computational workflows. We can integrate these data by linking their repositories systematically. To showcase our design, we have structured the TraIT-CLUC data, which contain a variety of molecular profiling data types, for storage in both tranSMART and EGA. The metadata provided allows referencing between tranSMART and EGA, fulfilling the cycle of data submission and discovery; we have also designed a data flow from EGA to Galaxy, enabling reanalysis of the raw data in Galaxy. In this way, users can select patient cohorts in tranSMART, trace them back to the raw data and perform (re)analysis in Galaxy. Our conclusion is that the majority of metadata does not necessarily need to be stored (redundantly) in both databases, but that instead FAIR persistent identifiers should be available for well-defined data ontology levels: study, data access committee, physical sample, data sample and raw data file. This approach will pave the way for the stable linkage and reuse of data.
AB - The availability of high-throughput molecular profiling techniques has provided more accurate and informative data for regular clinical studies. Nevertheless, complex computational workflows are required to interpret these data. Over the past years, the data volume has been growing explosively, requiring robust human data management to organise and integrate the data efficiently. For this reason, we set up an ELIXIR implementation study, together with the Translational research IT (TraIT) programme, to design a data ecosystem that is able to link raw and interpreted data. In this project, the data from the TraIT Cell Line Use Case (TraIT-CLUC) are used as a test case for this system. Within this ecosystem, we use the European Genome-phenome Archive (EGA) to store raw molecular profiling data; tranSMART to collect interpreted molecular profiling data and clinical data for corresponding samples; and Galaxy to store, run and manage the computational workflows. We can integrate these data by linking their repositories systematically. To showcase our design, we have structured the TraIT-CLUC data, which contain a variety of molecular profiling data types, for storage in both tranSMART and EGA. The metadata provided allows referencing between tranSMART and EGA, fulfilling the cycle of data submission and discovery; we have also designed a data flow from EGA to Galaxy, enabling reanalysis of the raw data in Galaxy. In this way, users can select patient cohorts in tranSMART, trace them back to the raw data and perform (re)analysis in Galaxy. Our conclusion is that the majority of metadata does not necessarily need to be stored (redundantly) in both databases, but that instead FAIR persistent identifiers should be available for well-defined data ontology levels: study, data access committee, physical sample, data sample and raw data file. This approach will pave the way for the stable linkage and reuse of data.
KW - Data management
KW - EGA
KW - FAIR
KW - Galaxy
KW - Reproducibility
KW - Translational research
KW - TranSMART
KW - Workflows
UR - http://www.scopus.com/inward/record.url?scp=85032944603&partnerID=8YFLogxK
U2 - 10.12688/f1000research.12168.1
DO - 10.12688/f1000research.12168.1
M3 - Article
AN - SCOPUS:85032944603
SN - 2046-1402
VL - 6
JO - F1000Research
JF - F1000Research
M1 - 1488
ER -