## Process: Spacers were extracted from metagenome reads, based on known repeats. 410,791,933 non-redundant spacers were mapped to the 48,949 microvirus genomes with bowtie1, which performs global alignment of spacers with up to 3 mismatches Non-redundant spacers are connected to repeat(s) and sample(s) (based on the reads they were detected in) ## Data are summarized in 3 files: nr_spacers_hq_vs_Microv50k_db_all_hits_hits_info.tsv: for each virus-repeat pair, includes the number of distinct spacer hits ("n_hits"), then the number of hits with 0, 1, 2, or 3 mismatches between spacer and genome. nr_spacers_hq_vs_Microv50k_db_all_hits_repeat_info.tsv: for each repeat with at least one spacer matching a microvirus with 0 or 1 mismatch, includes information we have about the repeat sequence itself (predicted CRISPR type, and LCA taxonomic assignment if available) nr_spacers_hq_vs_Microv50k_db_all_hits_sample_info.tsv: for each virus-repeat pair with at least one hit at 0 or 1 mismatch, lists the different ecosystem(s) and the number of libraries from which these spacers were obtained ## Things we (think we) have learned from this type of data: We typically ignore cases with only hits at 2 or 3 mismatches. These may be real, but there starts to be quite a lot of noise at this point With that many spacers, you can also typically focus on cases with multiple (5+, 10+, or even more) spacer hits at 0 or 1 mismatch for a given virus-repeat pair Each virus may be targeted by multiple repeats, but usually they all point to the same host taxon so we think these are reliable We keep the hits at 2 and 3 mismatches because we think there is a signal there in terms of escape mutations, but we are still working on the best way to analyze these