Score calibration#

Let’s say you’re using geNomad to classify the sequences in a given sample in order to identify viruses. One of the sequences received the following scores:

Chrom. score

Plasm. score

Virus score

0.0940

0.2602

0.6458

As its virus score is higher, you can say that it is most likely a viral sequence. But how likely?

To answer this, you would have to know the underlying composition of your sample, that is, what is the proportion of chromosomal, plasmid, and viral sequences in it. To illustrate why the sample composition affects the interpretation of the scores, imagine two contrasting situations: a metagenome where most of the sequences are from cellular origin, and a virome that is mostly comprised of viral sequences.

In the metagenome you would have a higher proportion of chromosomal sequences that were misclassified as viruses than in the virome, simply because there are more sequences of cellular origin that are succeptible to misclassification. On the other side, the virome would have a lower proportion of misclassified sequences among the identified viruses:

_images/calibration.svg

This means that, even though your sequence would receive the same scores across different executions, the actual probability of it being a virus depends on the context.

Unfortunately, it’s impossible to know beforehand the composition of a sample. geNomad can, however, compute quite accurate compoposition estimates using its own classification results. Because of that, you can use geNomad to calibrate the raw scores to get good estimates of the probability that your sequence of interest is actually a virus. In the two hypothetical situations presented above, you would get different distinct probabilities:

Composition

Chrom. score

Plasm. score

Virus score

No calibration

0.0940

0.2602

0.6458

Metagenome

0.4438

0.0033

0.5529

Virome

0.0257

0.0004

0.9739

geNomad’s score calibration provides two main benetifs:

  1. Allows more principled decisions by providing estimated false discovery rates.

  2. Improves the classification performance by changing the classification of some sequences after updating the scores with knowledge of the sample composition.