Question

How to download multiple fasta sequences and keep them organized by what genome they came from

0

Entering edit mode

13 months ago

ATS ▴ 10

Hi all,

I am trying to do protein co-evolution analysis for two proteins across many related bacterial species/strains. For this analysis, I'd like to download the sequences for protein 1 and protein 2 and then concatenate them together for each bacterial species and strain, like this:

Species1_Str1  Protein1Protein2
Species1_Str2  Protein1Protein2
Species2_Str1  Protein1Protein2
..

So far I've been using protein BLAST to download fasta files for all the homologs of my protein of interest. However, I'm having trouble pairing protein1 and protein2 sequences from the same bacterial strains. It seems like the fasta output stores the protein accession number, species name, and taxID, but not the specific strain information for each.

Is there a way to download two proteins from one genome, concatenate them, and then move on to the next genome? Or any other way to keep sequences from the same genome together after downloading them?

Thanks in advance for any advice! I am very new to bioinformatics.

Blast Sequences • 1.4k views

ADD COMMENT • link 13 months ago by ATS ▴ 10

score 0 · Answer 1 · 2024-06-04

Surely you are not downloading FASTA using BLAST, as BLAST is a local alignment searching tool and not a tool to download sequences. You're probably using BLAST to find the homologous sequences, then download them another way (though web?). When downloading a group of sequences, one would usually use SRA or other way to access NCBI services (it is explained on their website, for a start), create a list and then download the sequences with a script.

For a well-organized project, may I suggest you prepare different directories and perform the search by steps:

Use BLAST to locate the proteins you want to analyze
Save the information that is required for SRA access, like GenBank information or UniProtKB/Swiss-Prot information, accession and project name.
Create folders for each bacterial species and strain
Use the SRA/GenBank information to download all your sequences, in an orgnaized way, to each folder.