TY - JOUR
T1 - RaFAH
T2 - Host prediction for viruses of Bacteria and Archaea based on protein content
AU - Coutinho, Felipe Hernandes
AU - Zaragoza-Solas, Asier
AU - López-Pérez, Mario
AU - Barylski, Jakub
AU - Zielezinski, Andrzej
AU - Dutilh, Bas E
AU - Edwards, Robert
AU - Rodriguez-Valera, Francisco
N1 - Funding Information:
This work was supported by grants ?VIREVO? CGL2016-76273-P [MCI/AEI/FEDER, EU] (cofounded with FEDER funds) from the Spanish Ministerio de Ciencia e Innovaci?n and ?HIDRAS3? PROMETEU/2019/009 from Generalitat Valenciana. F.R.-V. was also a beneficiary of the 5top100-program of the Ministry for Science and Education of Russia. F.H.C. was supported by APOSTD/2018/186 post-doctoral fellowships from Generalitat Valenciana. A.Z. was funded by the Polish National Science Centre (2018/31/D/NZ2/00108). J.B.?s research was supported by the National Center for Research and Development (NCBR, Poland), grant number LIDER/5/0023/L-10/18/NCBR/2019. B.E.D. was supported by the Netherlands Organization for Scientific Research (NWO) Vidi grant 864.14.004 and by the European Research Council Consolidator grant 865694: DiversiPHI. R.E. was supported by National Institutes of Health grant RC2 DK116713-01A1. F.H.C. conceived and designed the experiments. F.H.C. A.Z.-S. M.L.-P. A.Z. J.B. B.E.D. and R.E. analyzed the data. All authors contributed to writing the manuscript. The authors declare no competing interests.
Funding Information:
This work was supported by grants “ VIREVO ” CGL2016-76273-P [MCI/AEI/FEDER, EU] (cofounded with FEDER funds) from the Spanish Ministerio de Ciencia e Innovación and “ HIDRAS3 ” PROMETEU/2019/009 from Generalitat Valenciana. F.R.-V. was also a beneficiary of the 5top100-program of the Ministry for Science and Education of Russia. F.H.C. was supported by APOSTD/2018/186 post-doctoral fellowships from Generalitat Valenciana. A.Z. was funded by the Polish National Science Centre ( 2018/31/D/NZ2/00108 ). J.B.’s research was supported by the National Center for Research and Development (NCBR, Poland), grant number LIDER/5/0023/L-10/18/NCBR/2019 . B.E.D. was supported by the Netherlands Organization for Scientific Research (NWO) Vidi grant 864.14.004 and by the European Research Council Consolidator grant 865694 : DiversiPHI. R.E. was supported by National Institutes of Health grant RC2 DK116713-01A1 .
Publisher Copyright:
© 2021 The Authors
PY - 2021/7/9
Y1 - 2021/7/9
N2 - Culture-independent approaches have recently shed light on the genomic diversity of viruses of prokaryotes. One fundamental question when trying to understand their ecological roles is: which host do they infect? To tackle this issue we developed a machine-learning approach named Random Forest Assignment of Hosts (RaFAH), that uses scores to 43,644 protein clusters to assign hosts to complete or fragmented genomes of viruses of Archaea and Bacteria. RaFAH displayed performance comparable with that of other methods for virus-host prediction in three different benchmarks encompassing viruses from RefSeq, single amplified genomes, and metagenomes. RaFAH was applied to assembled metagenomic datasets of uncultured viruses from eight different biomes of medical, biotechnological, and environmental relevance. Our analyses led to the identification of 537 sequences of archaeal viruses representing unknown lineages, whose genomes encode novel auxiliary metabolic genes, shedding light on how these viruses interfere with the host molecular machinery. RaFAH is available at https://sourceforge.net/projects/rafah/.
AB - Culture-independent approaches have recently shed light on the genomic diversity of viruses of prokaryotes. One fundamental question when trying to understand their ecological roles is: which host do they infect? To tackle this issue we developed a machine-learning approach named Random Forest Assignment of Hosts (RaFAH), that uses scores to 43,644 protein clusters to assign hosts to complete or fragmented genomes of viruses of Archaea and Bacteria. RaFAH displayed performance comparable with that of other methods for virus-host prediction in three different benchmarks encompassing viruses from RefSeq, single amplified genomes, and metagenomes. RaFAH was applied to assembled metagenomic datasets of uncultured viruses from eight different biomes of medical, biotechnological, and environmental relevance. Our analyses led to the identification of 537 sequences of archaeal viruses representing unknown lineages, whose genomes encode novel auxiliary metabolic genes, shedding light on how these viruses interfere with the host molecular machinery. RaFAH is available at https://sourceforge.net/projects/rafah/.
KW - DSML 2: Proof-of-concept: Data science output has been formulated, implemented, and tested for one domain/problem
KW - host prediction
KW - machine learning
KW - random forest
KW - viral diversity
KW - viral ecology
KW - virome
KW - virus
KW - virus-host associations
UR - http://www.scopus.com/inward/record.url?scp=85109448490&partnerID=8YFLogxK
U2 - 10.1016/j.patter.2021.100274
DO - 10.1016/j.patter.2021.100274
M3 - Article
C2 - 34286299
SN - 2666-3899
VL - 2
SP - 1
EP - 9
JO - Patterns (New York, N.Y.)
JF - Patterns (New York, N.Y.)
IS - 7
M1 - 100274
ER -