The geNomad pipeline#

When you execute the genomad end-to-end command, geNomad runs a series of modules sequentially to produce the final output, which contains the identified plasmids and viruses in the input FASTA file.

_images/pipeline_overview.svg

It is possible to execute the modules sequentially, which allows you to use some advanced parameters that are not available when you use the genomad end-to-end command:

genomad download-database .
genomad annotate metagenome.fna genomad_output genomad_db
genomad find-proviruses metagenome.fna genomad_output genomad_db
genomad marker-classification metagenome.fna genomad_output genomad_db
genomad nn-classification metagenome.fna genomad_output
genomad aggregated-classification metagenome.fna genomad_output
genomad score-calibration metagenome.fna genomad_output
genomad summary metagenome.fna genomad_output

For the majority of cases, using the genomad end-to-end parameter should be sufficient. However, it’s important to understand the processes involved when executing the full pipeline. Here, we will provide an explanation of each module’s function. Understanding these functions will help you grasp how geNomad processes your input sequences to identify plasmids and viruses.

annotate#

_images/annotate.svg

The annotate module has two main functions: predicting genes in the input sequences using pyrodigal-gv and assigning these predicted genes to marker protein families from a dataset of 227,897 profiles specific to chromosomes, plasmids, or viruses using MMseqs2. This marker dataset provides comprehensive metadata that can aid in the downstream interpretation of the results. It includes:

  • Functional annotations via Pfam, COG, TIGRFAM, and KEGG Orthology accessions.

  • Hallmark genes, which are involved in key plasmid or virus functions.

  • Conjugation genes, through CONJscan) accessions.

  • Antimicrobial resistance genes, via AMRFinder) accessions.

  • Universal single-copy genes (USCGs) that are typically present in chromosomes and rare in plasmids and viruses, identified using BUSCO.

  • Virus taxonomy, through the use of ICTV’s VMR number 19 lineages.

The annotate module generates two primary outputs: taxonomic assignments of the input sequences (you can find an explanation of how geNomad assigns sequences to viral taxa here), and gene-level annotations (as shown in the Quickstart example). These outputs are utilized by the find-proviruses, marker-classification, and summary modules.

find-proviruses#

_images/find_proviruses.svg

The find-proviruses module is designed to identify proviral regions within host sequences. To achieve this, it uses a conditional random field (CRF) model that takes gene annotations generated by the annotate module and demarcates regions that are enriched in viral-specific markers, surrounded by host-specific markers. To refine the boundaries of proviruses, geNomad leverages the fact that phages often integrate next to tRNAs and that integrases are typically found at the edges of integrated phages. This is achieved by extending the edges until neighboring tRNAs (identified with ARAGORN) and/or integrases (identified with MMseqs2) are reached. For a detailed explanation of geNomad’s provirus identification algorithm, please refer to our provirus identification documentation.

marker-classification#


_images/marker_classification.svg

The marker-classification module in geNomad is designed to classify sequences as either chromosomes, plasmids, or viruses based on their marker content. To achieve this, the module takes gene annotations and calculates a set of numerical features that describe the gene structure and marker content of the sequences that need to be classified. These features include gene density, as well as the frequency of chromosome, plasmid, and virus markers.

Below is an example of the features that are computed for five input sequences. You can learn more about how each feature is calculated by visiting our marker features documentation.

seq_name

strand_switch_rate

coding_density

no_rbs_freq

sd_bacteroidetes_rbs_freq

sd_canonical_rbs_freq

tatata_rbs_freq

cc_marker_freq

cp_marker_freq

cv_marker_freq

pc_marker_freq

pp_marker_freq

pv_marker_freq

vc_marker_freq

vp_marker_freq

vv_marker_freq

c_marker_freq

p_marker_freq

v_marker_freq

median_c_spm

median_p_spm

median_v_spm

v_vs_c_score_logistic

v_vs_p_score_logistic

p_vs_c_score_logistic

gv_marker_freq

sequence_1

0.0000

0.9049

1.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.5000

0.5000

0.5000

0.0000

sequence_2

0.0000

0.7845

0.5000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.5000

0.0000

0.0000

0.0000

0.5000

0.0278

0.4961

0.8678

0.6630

0.5914

0.5762

0.0000

sequence_3

1.0000

0.8704

1.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.5000

0.0000

0.0000

0.0000

0.5000

0.0086

0.2801

0.9599

0.6903

0.6557

0.5392

0.0000

sequence_4

0.0000

0.8087

0.0000

0.0000

1.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

1.0000

0.0000

0.0000

0.0000

1.0000

0.0027

0.2087

0.9780

0.8398

0.8064

0.5571

0.0000

sequence_5

0.0000

0.9861

0.5000

0.0000

0.5000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

1.0000

0.0000

0.0000

1.0000

0.0043

0.0000

1.0000

0.8474

0.8479

0.4989

0.0000

marker-classification then feeds these features to a tree ensemble classification algorithm, trained with XGBoost, which produces three scores for each sequence. These scores represent the model’s confidence that the sequence represents a chromosome, plasmid, or virus.

seq_name

chromosome_score

plasmid_score

virus_score

sequence_1

0.5420

0.1397

0.3183

sequence_2

0.2172

0.2148

0.5680

sequence_3

0.2937

0.1957

0.5106

sequence_4

0.0524

0.0718

0.8758

sequence_5

0.1621

0.0168

0.8211

In the example shown above, the model classified the first sequence as chromosome and the remaining sequences as viruses. With regards to the model’s confidence in its classification, it is more certain that sequence_4 and sequence_5 are viruses (with virus scores above 0.8) than it is of sequence_2 and sequence_3 (with virus scores around 0.5).

nn-classification#


_images/nn_classification.svg

The nn-classification module also classifies input sequences into chromosomes, plasmids, or viruses, similar to the marker-classification module. However, unlike the latter, it doesn’t rely on marker information. Instead, it directly processes nucleotide sequences using a neural network. The nucleotide sequences are first encoded into a numerical matrix, which is then fed into an IGLOO neural network. The network is capable of detecting sequence features that distinguish chromosomes, plasmids, and viruses. Finally, the module produces confidence scores for the classifications.

seq_name

chromosome_score

plasmid_score

virus_score

sequence_1

0.3307

0.5597

0.1096

sequence_2

0.0669

0.1411

0.7920

sequence_3

0.6720

0.1340

0.1940

sequence_4

0.2923

0.2830

0.4247

sequence_5

0.0591

0.1545

0.7864

If you’re interested in learning more about how the neural network processes and classifies nucleotide sequences, check out the detailed explanation.

aggregated-classification#


_images/aggregated_classification.svg

The aggreggated-classification module combines the outputs of marker-classification and nn-classification to produce a set of scores that takes advantage of the strengths of both classifiers.

seq_name

chromosome_score

plasmid_score

virus_score

sequence_1

0.2169

0.5661

0.2170

sequence_2

0.0513

0.1541

0.7946

sequence_3

0.4592

0.2033

0.3375

sequence_4

0.0402

0.0446

0.9153

sequence_5

0.0233

0.0276

0.9491

To achieve this, it employs an attention mechanism that weights the contributions of each classifier in such a way that the contribution of marker-classification increases proportionally to the proportion of genes assigned to markers. For more details on this process, please refer to the score aggregation documentation.

score-calibration#


_images/score_calibration.svg

The scores generated by marker-classification, nn-classification, and aggregated-classification indicate the confidence of these models in their predictions, with higher values reflecting greater confidence. However, these values are not equivalent to actual probabilities. For example, a sequence with an uncalibrated virus score of 0.87 does not have an 87% chance of being a virus.

score-calibration is an optional module that transforms the raw scores produced by the previous modules into estimated probabilities. This ensures that a sequence with a calibrated virus score of 0.87 will have a probability close to 87% probability of being a virus. If you want to understand how the score-calibration module works, refer to its documentation. To enable score calibration when using the end-to-end command, use the --enable-score-calibration parameter.

summary#

_images/summary.svg

The summary module serves three main functions: (1) filtering sequences based on various criteria to present users with the most reliable predictions (read more about the filtering process here), (2) summarizing the data generated by all previous modules for identified plasmids and viruses, and (3) writing FASTA files containing nucleotide and protein sequences for the identified plasmids and viruses, accompanied by gene annotation files. For examples of the plasmid and virus summary tables, refer to the Quickstart guide.