Hi all,
I am trying to do protein co-evolution analysis for two proteins across many related bacterial species/strains. For this analysis, I'd like to download the sequences for protein 1 and protein 2 and then concatenate them together for each bacterial species and strain, like this:
Species1_Str1 Protein1Protein2
Species1_Str2 Protein1Protein2
Species2_Str1 Protein1Protein2
..
So far I've been using protein BLAST to download fasta files for all the homologs of my protein of interest. However, I'm having trouble pairing protein1 and protein2 sequences from the same bacterial strains. It seems like the fasta output stores the protein accession number, species name, and taxID, but not the specific strain information for each.
Is there a way to download two proteins from one genome, concatenate them, and then move on to the next genome? Or any other way to keep sequences from the same genome together after downloading them?
Thanks in advance for any advice! I am very new to bioinformatics.
Honestly, I was using the protein Blast web browser to find homologous sequences, then using the multi-sequence download tool to save the homolog sequences of interest. This outputs a single FASTA file with all selected protein sequences, their species, and their accession number.
I was originally planning to write a script that looked at the files for both proteins and concatenated sequences that had the same species and strain info in the FASTA header. However, the resulting FASTA does not report strain info, only species.
Thank you for sharing resources on a different approach to take! I will dig into the NCBI website explanations.