README
======

Overview
--------
This directory contains scripts and data for searching nucleic acid sequences across various assemblies for similarities to specific reference sequences. The goal is to identify RNA viruses by examining sequence similarities and predicting open reading frames (ORFs).

Contents
--------
- all_ynp_matched.fasta: All matched contigs from the initial nucleic acid search.
- matches_partiti22.fasta: A subset of contigs matching the Cell22 suspected partiti with both "cp" and RdRp segments.
- selected_contigs_prot.gff: Predicted ORFs for the contigs in matches_partiti22.fasta.
- selected_contigs_prot.faa: Protein sequences from the predicted ORFs.
- CP_diamondp_matches.tab: Results of diamond blastp search for similarities to the suspected "CP" segment. Headers are in the file (first row).
- All_ynp_contigs_vs_refs_mm.tab2.gz Headers (fields) are qheader,theader,qlen,tlen,qstart,qend,tstart,tend,alnlen,mismatch,qcov,tcov,bits,evalue,gapopen,pident,nident

Workflow Steps
--------------
1. Initial Search:
   - Combine contigs from different assemblies and search for similarities to sequences from Cell22, NCBI Ribovirus, the suspected "CP" segment, and Japanese hot spring sequences.
   - Output matched contigs to all_ynp_matched.fasta.

2. Filtering for Specific Matches:
   - Extract contigs matching the Cell22 suspected partiti with both "cp" and RdRp (Ga0068666_1034815 and Ga0068641_186182).
   - Save these matches to matches_partiti22.fasta.

3. Predicting ORFs:
   - Predict ORFs on the filtered contigs using Pyrodigal.
   - Output results to selected_contigs_prot.gff and selected_contigs_prot.faa.

4. Search for "CP" Segment:
   - Perform a diamond blastp search of the predicted proteins against the suspected "CP" segment.
   - Save results to CP_diamondp_matches.tab.