Question

MUSCLE alignment - aligning whole genome with gene sequences

0

Entering edit mode

3.3 years ago

manaswwm ▴ 550

Hello all,

I want to perform multiple alignments of genes to a reference sequence using Muscle (https://www.drive5.com/muscle/) locally. The reference sequence is stored in a fasta file, and every chromosome is stored as different fasta sequence within this fasta file. To give you examples:

**gene.fa**
>gene1
ATGATG....

**reference_genome.fa**
>chr1
ATCGTGA...
>chr2
GTACTAA...
>chr3
ATCTGTC...

When I go through Muscle manual, it mentions the command for performing alignments to be something like:

muscle -in seqs.fa -out seqs.afa

All the input sequences are to be stored within seqs.fa (if I understand it correctly). My question: in my case how do I create the seqs.fa file, given that the chromosomes themselves are stored as different fasta sequences?

Any help would be appreciated!

alignment mega • 1.8k views

ADD COMMENT • link updated 3.3 years ago by GenoMax 147k • written 3.3 years ago by manaswwm ▴ 550

0

Entering edit mode

You can simply cat the chromosome fasta files to create full reference multi-fasta.

cat chr1.fa chr2.fa ... > ref.fa

Note: Doing this may be perilous. A sequence may align to more than one chromosome so MSA's may not work properly.

ADD REPLY • link 3.3 years ago by GenoMax 147k

0

Entering edit mode

I may be misunderstanding what you have written, but to make it clear once again - the chromosome-specific fasta sequences are already stored/contained within a ref.fa. My question is - what should be my input fasta file to Muscle? The one which has all the chromosome fasta sequences (separate entry per chromosome) and the fasta sequence of the gene, combined? Would that not mean that Muscle will try to align all chromosomes against each other in addition to aligning the gene with the chromosomes?

ADD REPLY • link 3.3 years ago by manaswwm ▴ 550

0

Entering edit mode

multiple alignments of genes to a reference sequence

If you just want align genes to the reference then your input should only contain the gene sequence.

ADD REPLY • link 3.3 years ago by GenoMax 147k

0

Entering edit mode

Okay, then how do I specify the target? For example in blastn - I can create a database from the reference genome using:

makeblastdb -in reference.fa -dbtype nucl

and then align the gene using:

blastn -query gene.fa -db reference.fa

Does Muscle also support this way of creating a "local" database of the reference genome and then aligning? I have not come across it. If this is becoming too naive then would you mind giving me an example of how you would do it with Muscle (the command that you would use)?

ADD REPLY • link 3.3 years ago by manaswwm ▴ 550

1

Entering edit mode

I think you may be mistaken here. Blast shows you a pair-wise alignment of query/subject as a convenient way of visualizing the hits. It is identifying the region using similarity searching. Multiple alignment programs on other hand are not doing similarity searches. They need defined/refined inputs e.g. gene(s) or regions that are supposed to have some (evolutionary) relationship. It does not make sense to try to do a multiple sequence alignment of genes (much smaller) against an entire chromosome/genome reference.

Identify the genes/regions that you want to do a MSA for from the reference (this can be done using blast). Isolate those sequence regions in a file and then run muscle on that selected dataset.

ADD REPLY • link 3.3 years ago by GenoMax 147k