Overview:
These files are related to the SpacerDB database construction, i.e. a database of CRISPR spacers extracted from short-read metagenomes. These spacers were obtained with SpacerExtractor (https://code.jgi.doe.gov/SRoux/spacerextractor) and the database itself is described in more detail at https://spacers.jgi.doe.gov.


## Files available for download:

Accompanying data for "Planetary-scale metagenomic search reveals new patterns of CRISPR targeting" manuscript (https://doi.org/10.1101/2025.06.12.659409):
- Supplementary_Data_File_1_repeats.tsv: Table of all CRISPR repeats used to identify potential CRISPR spacers
- Supplementary_Data_File_2_samples.tsv: Table of all samples mined for CRISPR spacers
- Spacerdb_raw_files.tar.gz: Tar archive including all tsv files with original input data, that can be used with scripts provided in https://github.com/simroux/globalspacers_scripts to retrace the steps of the original analysis. New analyses should instead use the databases provided at https://spacers.jgi.doe.gov

Example files used in the notebooks available at https://spacers.jgi.doe.gov/notebooks/:
- Example_viruses.fna: fasta file of the virus genomes used in the "SpacerDB Hit Analysis Examples" notebook
- nr_spacers_hq-taxoselected_25-05-10_vs_Example_viruses_db_all_hits.tsv: spacer hits table used in the "SpacerDB Hit Analysis Examples" notebook

Fasta files and database export of all high-quality spacers associated with a taxonomically assigned repeat (THESE ARE LIKELY THE FILES YOU NEED IF YOU WANT TO CONNECT MGE TO POTENTIAL HOSTS):
- nr_spacers_hq-taxoselected_25-05-10.fna.gz: fasta file of high-quality spacers for which a repeat taxonomic assignment is available, non-redundant and gzipped, generated on May 10, 2025. These spacers are meant for host prediction analyses, were extracted from global_crispr_db_spacertaxa_2025-05-02.duckdb, and sequence identifiers correspond to spacer cluster id (see https://spacers.jgi.doe.gov). 
- global_crispr_db_spacertaxa_2025-05-02-tsv-bundle.tar.gz: tsv files providing information about the spacers included in the above fasta file. Specifically, this bundle includes:
	All_spacers_info_filtered_clusters-Jul19-24.tsv: Provides a link between spacer cluster and individual spacers (i.e. which spacers are members of which clusters). 
	All_spacers_info_filtered-Jul19-24.tsv: Information about individual spacer, including original sample, repeat, and sequence. Note that the repeat information is also coded in the spacer identifier itself (second field, starting with "Ac_")
	Repeat_info_filtered_for_db-Nov1-24.tsv: Information about individual repeats, including taxonomic assignment and predicted CRISPR type
	Runs_to_ecosystem_and_sequencing_and_study_for_db-Jul28-24.tsv: Information about individual samples, including ecosystem classification
- global_crispr_db_spacertaxa_2025-05-02.duckdb: the database version of the above tsv bundle, available from https://spacers.jgi.doe.gov/s3-gateway/global_crispr_db_spacertaxa_2025-05-02.duckdb . See more information at https://spacers.jgi.doe.gov/quick_start/#database-versions . Only one of the two (the tsv or the duckdb database) are needed since these include the same information, but they are provided in these two different formats for convenience.


Fasta file export of all high-quality spacers:
- nr_spacers_hq-all_25-05-10.fna.gz: fasta file of high-quality spacers, non-redundant and gzipped, generated on May 10, 2025. These spacers were extracted from global_crispr_db_full_2025-05-02.duckdb, and sequence identifiers correspond to spacer cluster id (see https://spacers.jgi.doe.gov).