How to download multiple fasta sequences and keep them organized by what genome they came from
1
0
Entering edit mode
7 months ago
ATS • 0

Hi all,

I am trying to do protein co-evolution analysis for two proteins across many related bacterial species/strains. For this analysis, I'd like to download the sequences for protein 1 and protein 2 and then concatenate them together for each bacterial species and strain, like this:

Species1_Str1  Protein1Protein2
Species1_Str2  Protein1Protein2
Species2_Str1  Protein1Protein2
..

So far I've been using protein BLAST to download fasta files for all the homologs of my protein of interest. However, I'm having trouble pairing protein1 and protein2 sequences from the same bacterial strains. It seems like the fasta output stores the protein accession number, species name, and taxID, but not the specific strain information for each.

Is there a way to download two proteins from one genome, concatenate them, and then move on to the next genome? Or any other way to keep sequences from the same genome together after downloading them?

Thanks in advance for any advice! I am very new to bioinformatics.

Blast Sequences • 572 views
ADD COMMENT
0
Entering edit mode
7 months ago
inedraylig ▴ 70

Surely you are not downloading FASTA using BLAST, as BLAST is a local alignment searching tool and not a tool to download sequences. You're probably using BLAST to find the homologous sequences, then download them another way (though web?). When downloading a group of sequences, one would usually use SRA or other way to access NCBI services (it is explained on their website, for a start), create a list and then download the sequences with a script.

For a well-organized project, may I suggest you prepare different directories and perform the search by steps:

  1. Use BLAST to locate the proteins you want to analyze
  2. Save the information that is required for SRA access, like GenBank information or UniProtKB/Swiss-Prot information, accession and project name.
  3. Create folders for each bacterial species and strain
  4. Use the SRA/GenBank information to download all your sequences, in an orgnaized way, to each folder.
ADD COMMENT
0
Entering edit mode

Honestly, I was using the protein Blast web browser to find homologous sequences, then using the multi-sequence download tool to save the homolog sequences of interest. This outputs a single FASTA file with all selected protein sequences, their species, and their accession number.

enter image description here

I was originally planning to write a script that looked at the files for both proteins and concatenated sequences that had the same species and strain info in the FASTA header. However, the resulting FASTA does not report strain info, only species.

Thank you for sharing resources on a different approach to take! I will dig into the NCBI website explanations.

ADD REPLY

Login before adding your answer.

Traffic: 1992 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6