I am trying to align corresponding fasta records from separate multi-fasta files which all contain the same number of fasta records. Each multi-fasta file contains ordered orthologous nt sequences. The format is as follows:
Multi-fasta for strain 1:
>Strain1_ortholog1
ATGC
>Strain1_ortholog2
GACT
Multi-fasta for strain 2:
>Strain2_ortholog1
ATGC
>Strain2_ortholog2
GATT
I have 21 strains, where each strain's multi-fasta file contains 592 ordered orthologs, and I would like my output to be strain-specific aligned multi-fasta files (i.e. the fasta sequences should contain gaps where appropriate). I am wondering if there is a good script/tool I can use to accomplish this. Thanks for any input you can provide!
If you have multi-fasta files that are strain specific they can directly go into a MSA program.
If you wanted all
ortholog_1
to go in one file then: a putative workflow. Split the files per strain for each ortholog (faSplit
from Jim Kent can be one option).cat
allortholog_1
files together and then do MSA for each of these files."If you have multi-fasta files that are strain specific they can directly go into a MSA program."
Wouldn't most MSA programs align all the records in a single file, rather than aligning corresponding sequences between files?
It was not clear from your original question since you had the part I had quoted in my last comment. You already appear to have
strain specific
files based on the example posted.If you mean that all
ortholog_1
from different strains need to go in one alignment then follow the second workflow I proposed above. This assumes thatStrain1_ortholog_1
directly corresponds toStrain2_ortholog_1
and so on.Assuming that OP requirement is to ortholog specific MSA:
List headers from a single file (assuming that all files have same number of strains and share same names) and extract ortholog part.
Using this file, query all the fasta files serially or in parallel for each strain ortholog (from step 1) and output each strain query output in individual strain ortholog fasta files.
subject each strain ortholog sequences to MSA serially or in parallel with same parameters.
If OP requirement is to have strain specific MSA: