The geNomad pipeline#
When you execute the genomad end-to-end
command, geNomad runs a series of modules sequentially to produce the final output, which contains the identified plasmids and viruses in the input FASTA file.
It is possible to execute the modules sequentially, which allows you to use some advanced parameters that are not available when you use the genomad end-to-end
command:
genomad download-database .
genomad annotate metagenome.fna genomad_output genomad_db
genomad find-proviruses metagenome.fna genomad_output genomad_db
genomad marker-classification metagenome.fna genomad_output genomad_db
genomad nn-classification metagenome.fna genomad_output
genomad aggregated-classification metagenome.fna genomad_output
genomad score-calibration metagenome.fna genomad_output
genomad summary metagenome.fna genomad_output
For the majority of cases, using the genomad end-to-end
parameter should be sufficient. However, it’s important to understand the processes involved when executing the full pipeline. Here, we will provide an explanation of each module’s function. Understanding these functions will help you grasp how geNomad processes your input sequences to identify plasmids and viruses.
annotate
#
The annotate
module has two main functions: predicting genes in the input sequences using pyrodigal-gv
and assigning these predicted genes to marker protein families from a dataset of 227,897 profiles specific to chromosomes, plasmids, or viruses using MMseqs2
. This marker dataset provides comprehensive metadata that can aid in the downstream interpretation of the results. It includes:
Functional annotations via Pfam, COG, TIGRFAM, and KEGG Orthology accessions.
Hallmark genes, which are involved in key plasmid or virus functions.
Conjugation genes, through CONJscan) accessions.
Antimicrobial resistance genes, via AMRFinder) accessions.
Universal single-copy genes (USCGs) that are typically present in chromosomes and rare in plasmids and viruses, identified using BUSCO.
Virus taxonomy, through the use of ICTV’s VMR number 19 lineages.
The annotate
module generates two primary outputs: taxonomic assignments of the input sequences (you can find an explanation of how geNomad assigns sequences to viral taxa here), and gene-level annotations (as shown in the Quickstart example). These outputs are utilized by the find-proviruses
, marker-classification
, and summary
modules.
find-proviruses
#
The find-proviruses
module is designed to identify proviral regions within host sequences. To achieve this, it uses a conditional random field (CRF) model that takes gene annotations generated by the annotate
module and demarcates regions that are enriched in viral-specific markers, surrounded by host-specific markers. To refine the boundaries of proviruses, geNomad leverages the fact that phages often integrate next to tRNAs and that integrases are typically found at the edges of integrated phages. This is achieved by extending the edges until neighboring tRNAs (identified with ARAGORN
) and/or integrases (identified with MMseqs2
) are reached. For a detailed explanation of geNomad’s provirus identification algorithm, please refer to our provirus identification documentation.
marker-classification
#
The marker-classification
module in geNomad is designed to classify sequences as either chromosomes, plasmids, or viruses based on their marker content. To achieve this, the module takes gene annotations and calculates a set of numerical features that describe the gene structure and marker content of the sequences that need to be classified. These features include gene density, as well as the frequency of chromosome, plasmid, and virus markers.
Below is an example of the features that are computed for five input sequences. You can learn more about how each feature is calculated by visiting our marker features documentation.
seq_name |
strand_switch_rate |
coding_density |
no_rbs_freq |
sd_bacteroidetes_rbs_freq |
sd_canonical_rbs_freq |
tatata_rbs_freq |
cc_marker_freq |
cp_marker_freq |
cv_marker_freq |
pc_marker_freq |
pp_marker_freq |
pv_marker_freq |
vc_marker_freq |
vp_marker_freq |
vv_marker_freq |
c_marker_freq |
p_marker_freq |
v_marker_freq |
median_c_spm |
median_p_spm |
median_v_spm |
v_vs_c_score_logistic |
v_vs_p_score_logistic |
p_vs_c_score_logistic |
gv_marker_freq |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
sequence_1 |
0.0000 |
0.9049 |
1.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.5000 |
0.5000 |
0.5000 |
0.0000 |
sequence_2 |
0.0000 |
0.7845 |
0.5000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.5000 |
0.0000 |
0.0000 |
0.0000 |
0.5000 |
0.0278 |
0.4961 |
0.8678 |
0.6630 |
0.5914 |
0.5762 |
0.0000 |
sequence_3 |
1.0000 |
0.8704 |
1.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.5000 |
0.0000 |
0.0000 |
0.0000 |
0.5000 |
0.0086 |
0.2801 |
0.9599 |
0.6903 |
0.6557 |
0.5392 |
0.0000 |
sequence_4 |
0.0000 |
0.8087 |
0.0000 |
0.0000 |
1.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
1.0000 |
0.0000 |
0.0000 |
0.0000 |
1.0000 |
0.0027 |
0.2087 |
0.9780 |
0.8398 |
0.8064 |
0.5571 |
0.0000 |
sequence_5 |
0.0000 |
0.9861 |
0.5000 |
0.0000 |
0.5000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
0.0000 |
1.0000 |
0.0000 |
0.0000 |
1.0000 |
0.0043 |
0.0000 |
1.0000 |
0.8474 |
0.8479 |
0.4989 |
0.0000 |
marker-classification
then feeds these features to a tree ensemble classification algorithm, trained with XGBoost
, which produces three scores for each sequence. These scores represent the model’s confidence that the sequence represents a chromosome, plasmid, or virus.
seq_name |
chromosome_score |
plasmid_score |
virus_score |
---|---|---|---|
sequence_1 |
0.5420 |
0.1397 |
0.3183 |
sequence_2 |
0.2172 |
0.2148 |
0.5680 |
sequence_3 |
0.2937 |
0.1957 |
0.5106 |
sequence_4 |
0.0524 |
0.0718 |
0.8758 |
sequence_5 |
0.1621 |
0.0168 |
0.8211 |
In the example shown above, the model classified the first sequence as chromosome and the remaining sequences as viruses. With regards to the model’s confidence in its classification, it is more certain that sequence_4
and sequence_5
are viruses (with virus scores above 0.8) than it is of sequence_2
and sequence_3
(with virus scores around 0.5).
nn-classification
#
The nn-classification
module also classifies input sequences into chromosomes, plasmids, or viruses, similar to the marker-classification
module. However, unlike the latter, it doesn’t rely on marker information. Instead, it directly processes nucleotide sequences using a neural network. The nucleotide sequences are first encoded into a numerical matrix, which is then fed into an IGLOO neural network. The network is capable of detecting sequence features that distinguish chromosomes, plasmids, and viruses. Finally, the module produces confidence scores for the classifications.
seq_name |
chromosome_score |
plasmid_score |
virus_score |
---|---|---|---|
sequence_1 |
0.3307 |
0.5597 |
0.1096 |
sequence_2 |
0.0669 |
0.1411 |
0.7920 |
sequence_3 |
0.6720 |
0.1340 |
0.1940 |
sequence_4 |
0.2923 |
0.2830 |
0.4247 |
sequence_5 |
0.0591 |
0.1545 |
0.7864 |
If you’re interested in learning more about how the neural network processes and classifies nucleotide sequences, check out the detailed explanation.
aggregated-classification
#
The aggreggated-classification
module combines the outputs of marker-classification
and nn-classification
to produce a set of scores that takes advantage of the strengths of both classifiers.
seq_name |
chromosome_score |
plasmid_score |
virus_score |
---|---|---|---|
sequence_1 |
0.2169 |
0.5661 |
0.2170 |
sequence_2 |
0.0513 |
0.1541 |
0.7946 |
sequence_3 |
0.4592 |
0.2033 |
0.3375 |
sequence_4 |
0.0402 |
0.0446 |
0.9153 |
sequence_5 |
0.0233 |
0.0276 |
0.9491 |
To achieve this, it employs an attention mechanism that weights the contributions of each classifier in such a way that the contribution of marker-classification
increases proportionally to the proportion of genes assigned to markers. For more details on this process, please refer to the score aggregation documentation.
score-calibration
#
The scores generated by marker-classification
, nn-classification
, and aggregated-classification
indicate the confidence of these models in their predictions, with higher values reflecting greater confidence. However, these values are not equivalent to actual probabilities. For example, a sequence with an uncalibrated virus score of 0.87 does not have an 87% chance of being a virus.
score-calibration
is an optional module that transforms the raw scores produced by the previous modules into estimated probabilities. This ensures that a sequence with a calibrated virus score of 0.87 will have a probability close to 87% probability of being a virus. If you want to understand how the score-calibration
module works, refer to its documentation. To enable score calibration when using the end-to-end
command, use the --enable-score-calibration
parameter.
summary
#
The summary
module serves three main functions: (1) filtering sequences based on various criteria to present users with the most reliable predictions (read more about the filtering process here), (2) summarizing the data generated by all previous modules for identified plasmids and viruses, and (3) writing FASTA files containing nucleotide and protein sequences for the identified plasmids and viruses, accompanied by gene annotation files. For examples of the plasmid and virus summary tables, refer to the Quickstart guide.