Hello all,
I want to perform multiple alignments of genes to a reference sequence using Muscle (https://www.drive5.com/muscle/) locally. The reference sequence is stored in a fasta file, and every chromosome is stored as different fasta sequence within this fasta file. To give you examples:
**gene.fa**
>gene1
ATGATG....
**reference_genome.fa**
>chr1
ATCGTGA...
>chr2
GTACTAA...
>chr3
ATCTGTC...
When I go through Muscle manual, it mentions the command for performing alignments to be something like:
muscle -in seqs.fa -out seqs.afa
All the input sequences are to be stored within seqs.fa
(if I understand it correctly). My question: in my case how do I create the seqs.fa
file, given that the chromosomes themselves are stored as different fasta sequences?
Any help would be appreciated!
You can simply
cat
the chromosome fasta files to create full reference multi-fasta.Note: Doing this may be perilous. A sequence may align to more than one chromosome so MSA's may not work properly.
I may be misunderstanding what you have written, but to make it clear once again - the chromosome-specific fasta sequences are already stored/contained within a
ref.fa
. My question is - what should be my input fasta file to Muscle? The one which has all the chromosome fasta sequences (separate entry per chromosome) and the fasta sequence of the gene, combined? Would that not mean that Muscle will try to align all chromosomes against each other in addition to aligning the gene with the chromosomes?If you just want align genes to the reference then your input should only contain the gene sequence.
Okay, then how do I specify the target? For example in blastn - I can create a database from the reference genome using:
and then align the gene using:
Does Muscle also support this way of creating a "local" database of the reference genome and then aligning? I have not come across it. If this is becoming too naive then would you mind giving me an example of how you would do it with Muscle (the command that you would use)?
I think you may be mistaken here. Blast shows you a pair-wise alignment of query/subject as a convenient way of visualizing the
hits
. It is identifying the region using similarity searching. Multiple alignment programs on other hand are not doing similarity searches. They need defined/refined inputs e.g. gene(s) or regions that are supposed to have some (evolutionary) relationship. It does not make sense to try to do a multiple sequence alignment of genes (much smaller) against an entire chromosome/genome reference.Identify the genes/regions that you want to do a MSA for from the reference (this can be done using blast). Isolate those sequence regions in a file and then run
muscle
on that selected dataset.