The geNomad pipeline#

When you execute the genomad end-to-end command, geNomad runs a series of modules sequentially to produce the final output, which contains the identified plasmids and viruses in the input FASTA file.

It is possible to execute the modules sequentially, which allows you to use some advanced parameters that are not available when you use the genomad end-to-end command:

genomad download-database .
genomad annotate metagenome.fna genomad_output genomad_db
genomad find-proviruses metagenome.fna genomad_output genomad_db
genomad marker-classification metagenome.fna genomad_output genomad_db
genomad nn-classification metagenome.fna genomad_output
genomad aggregated-classification metagenome.fna genomad_output
genomad score-calibration metagenome.fna genomad_output
genomad summary metagenome.fna genomad_output

For the majority of cases, using the genomad end-to-end parameter should be sufficient. However, it’s important to understand the processes involved when executing the full pipeline. Here, we will provide an explanation of each module’s function. Understanding these functions will help you grasp how geNomad processes your input sequences to identify plasmids and viruses.

`annotate`#

The annotate module has two main functions: predicting genes in the input sequences using pyrodigal-gv and assigning these predicted genes to marker protein families from a dataset of 227,897 profiles specific to chromosomes, plasmids, or viruses using MMseqs2. This marker dataset provides comprehensive metadata that can aid in the downstream interpretation of the results. It includes:

Functional annotations via Pfam, COG, TIGRFAM, and KEGG Orthology accessions.
Hallmark genes, which are involved in key plasmid or virus functions.
Conjugation genes, through CONJscan) accessions.
Antimicrobial resistance genes, via AMRFinder) accessions.
Universal single-copy genes (USCGs) that are typically present in chromosomes and rare in plasmids and viruses, identified using BUSCO.
Virus taxonomy, through the use of ICTV’s VMR number 19 lineages.

The annotate module generates two primary outputs: taxonomic assignments of the input sequences (you can find an explanation of how geNomad assigns sequences to viral taxa here), and gene-level annotations (as shown in the Quickstart example). These outputs are utilized by the find-proviruses, marker-classification, and summary modules.

`find-proviruses`#

The find-proviruses module is designed to identify proviral regions within host sequences. To achieve this, it uses a conditional random field (CRF) model that takes gene annotations generated by the annotate module and demarcates regions that are enriched in viral-specific markers, surrounded by host-specific markers. To refine the boundaries of proviruses, geNomad leverages the fact that phages often integrate next to tRNAs and that integrases are typically found at the edges of integrated phages. This is achieved by extending the edges until neighboring tRNAs (identified with ARAGORN) and/or integrases (identified with MMseqs2) are reached. For a detailed explanation of geNomad’s provirus identification algorithm, please refer to our provirus identification documentation.

`marker-classification`#

The marker-classification module in geNomad is designed to classify sequences as either chromosomes, plasmids, or viruses based on their marker content. To achieve this, the module takes gene annotations and calculates a set of numerical features that describe the gene structure and marker content of the sequences that need to be classified. These features include gene density, as well as the frequency of chromosome, plasmid, and virus markers.

Below is an example of the features that are computed for five input sequences. You can learn more about how each feature is calculated by visiting our marker features documentation.

seq_name	strand_switch_rate	coding_density	no_rbs_freq	sd_canonical_rbs_freq	vp_marker_freq	vv_marker_freq	v_marker_freq	median_c_spm	median_p_spm	median_v_spm	v_vs_c_score_logistic	v_vs_p_score_logistic	p_vs_c_score_logistic
sequence_1	0.0000	0.9049	1.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000	0.5000	0.5000	0.5000
sequence_2	0.0000	0.7845	0.5000	0.0000	0.5000	0.0000	0.5000	0.0278	0.4961	0.8678	0.6630	0.5914	0.5762
sequence_3	1.0000	0.8704	1.0000	0.0000	0.5000	0.0000	0.5000	0.0086	0.2801	0.9599	0.6903	0.6557	0.5392
sequence_4	0.0000	0.8087	0.0000	1.0000	1.0000	0.0000	1.0000	0.0027	0.2087	0.9780	0.8398	0.8064	0.5571
sequence_5	0.0000	0.9861	0.5000	0.5000	0.0000	1.0000	1.0000	0.0043	0.0000	1.0000	0.8474	0.8479	0.4989

marker-classification then feeds these features to a tree ensemble classification algorithm, trained with XGBoost, which produces three scores for each sequence. These scores represent the model’s confidence that the sequence represents a chromosome, plasmid, or virus.

seq_name	chromosome_score	plasmid_score	virus_score
sequence_1	0.5420	0.1397	0.3183
sequence_2	0.2172	0.2148	0.5680
sequence_3	0.2937	0.1957	0.5106
sequence_4	0.0524	0.0718	0.8758
sequence_5	0.1621	0.0168	0.8211

In the example shown above, the model classified the first sequence as chromosome and the remaining sequences as viruses. With regards to the model’s confidence in its classification, it is more certain that sequence_4 and sequence_5 are viruses (with virus scores above 0.8) than it is of sequence_2 and sequence_3 (with virus scores around 0.5).

`nn-classification`#

The nn-classification module also classifies input sequences into chromosomes, plasmids, or viruses, similar to the marker-classification module. However, unlike the latter, it doesn’t rely on marker information. Instead, it directly processes nucleotide sequences using a neural network. The nucleotide sequences are first encoded into a numerical matrix, which is then fed into an IGLOO neural network. The network is capable of detecting sequence features that distinguish chromosomes, plasmids, and viruses. Finally, the module produces confidence scores for the classifications.

seq_name	chromosome_score	plasmid_score	virus_score
sequence_1	0.3307	0.5597	0.1096
sequence_2	0.0669	0.1411	0.7920
sequence_3	0.6720	0.1340	0.1940
sequence_4	0.2923	0.2830	0.4247
sequence_5	0.0591	0.1545	0.7864

If you’re interested in learning more about how the neural network processes and classifies nucleotide sequences, check out the detailed explanation.

`aggregated-classification`#

The aggreggated-classification module combines the outputs of marker-classification and nn-classification to produce a set of scores that takes advantage of the strengths of both classifiers.

seq_name	chromosome_score	plasmid_score	virus_score
sequence_1	0.2169	0.5661	0.2170
sequence_2	0.0513	0.1541	0.7946
sequence_3	0.4592	0.2033	0.3375
sequence_4	0.0402	0.0446	0.9153
sequence_5	0.0233	0.0276	0.9491

To achieve this, it employs an attention mechanism that weights the contributions of each classifier in such a way that the contribution of marker-classification increases proportionally to the proportion of genes assigned to markers. For more details on this process, please refer to the score aggregation documentation.

`score-calibration`#

The scores generated by marker-classification, nn-classification, and aggregated-classification indicate the confidence of these models in their predictions, with higher values reflecting greater confidence. However, these values are not equivalent to actual probabilities. For example, a sequence with an uncalibrated virus score of 0.87 does not have an 87% chance of being a virus.

score-calibration is an optional module that transforms the raw scores produced by the previous modules into estimated probabilities. This ensures that a sequence with a calibrated virus score of 0.87 will have a probability close to 87% probability of being a virus. If you want to understand how the score-calibration module works, refer to its documentation. To enable score calibration when using the end-to-end command, use the --enable-score-calibration parameter.

`summary`#

The summary module serves three main functions: (1) filtering sequences based on various criteria to present users with the most reliable predictions (read more about the filtering process here), (2) summarizing the data generated by all previous modules for identified plasmids and viruses, and (3) writing FASTA files containing nucleotide and protein sequences for the identified plasmids and viruses, accompanied by gene annotation files. For examples of the plasmid and virus summary tables, refer to the Quickstart guide.

The geNomad pipeline#

annotate#

find-proviruses#

marker-classification#

nn-classification#

aggregated-classification#

score-calibration#

summary#