Machine learning enables scalable and systematic hierarchical virus taxonomy

  • Benjamin Bolduc*
  • , Olivier Zablocki
  • , Dann Turner
  • , Ho Bin Jang
  • , Jiarong Guo
  • , Evelien M Adriaenssens
  • , Bas E Dutilh
  • , Matthew B Sullivan*
  • *Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

Although virus ecogenomics has expanded access to and understanding of the virosphere, existing classification tools lack taxonomic resolution and are unable to scale to modern discovery-based datasets or classify previously unknown sequence space. Here we develop vConTACT3-a machine learning-based tool that improves scalability and accuracy of virus taxonomy. By optimizing gene-sharing thresholds and leveraging adaptive, realm-specific cut-offs, vConTACT3 expands classification to both eukaryote and prokaryote viruses for four of the six officially recognized realms, and establishes accurate hierarchical taxonomy from genus to order. Specifically, vConTACT3 achieves >95% agreement with official taxonomy for 35,545 and 13,524 public prokaryotic and eukaryotic virus genomes, respectively, to surpass vConTACT2 across most realms, while still uniquely classifying previously uncharacterized taxa, and doing so even faster. vConTACT3 application provides taxonomy assignments for tens of thousands of unclassified taxa rapidly, automatically and systematically; evaluates virus sequence space to reveal support for fewer taxonomic ranks than currently available and identifies taxonomically challenging areas across the virosphere.

Original languageEnglish
JournalNature Biotechnology
DOIs
Publication statusE-pub ahead of print - 19 Dec 2025

Bibliographical note

Publisher Copyright:
© The Author(s), under exclusive licence to Springer Nature America, Inc. 2025.

Funding

This work was supported by the National Science Foundation under Grants No. DBI-2149505 (iVirus2) and DBI-2022070 (BII-Implementation: the EMERGE Institute). This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research, under Award Number DE-SC0023307. High-performance computating was provided by the Ohio Supercomputer Center. Additional support was provided by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy (EXC 2051) Project-ID 390713860, the European Research Council (ERC) Consolidator grant 865694: DiversiPHI, the Alexander von Humboldt Foundation in the context of an Alexander von Humboldt-Professorship founded by German Federal Ministry of Education and Research, and the European Union's Horizon 2020 research and innovation program, under the Marie Sk & lstrok;odowska-Curie Actions Innovative Training Networks grant agreement no. 955974 (VIROINF). EMA gratefully acknowledges the support of the Biotechnology and Biological Sciences Research Council (BBSRC); this research was funded by the BBSRC Institute Strategic Programme Food Microbiome and Health BB/X011054/1 and its constituent projects BBS/E/QU/230001B and BBS/E/QU/230001D, as well as the BBSRC Institute Strategic Programme Microbes and Food Safety BB/X011011/1 and its constituent projects BBS/E/QU/230002A, BBS/E/QU/230002B and BBS/E/QU/230002C.

FundersFunder number
Biotechnology and Biological Sciences Research Council (BBSRC); this research was funded by the BBSRC Institute Strategic Programme Food Microbiome and Health BB/X011054/1 and its constituent projects BBS/E/QU/230001B and BBS/E/QU/230001D, as well as the t
The Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany's Excellence Strategy - EXC 2051 - Project-ID 390713860, the European Research Council (ERC) Consolidator grant 865694: DiversiPHI, the Alexander von Humboldt Foundation in
National Science Foundation (NSF)DBI-2149505, DBI-2022070

    Fingerprint

    Dive into the research topics of 'Machine learning enables scalable and systematic hierarchical virus taxonomy'. Together they form a unique fingerprint.

    Cite this