Question

Multi-Sequence Alignment For Many Groups Of Genes

0

Entering edit mode

10.9 years ago

biolab ★ 1.4k

Dear all, How to make muli-sequence alingment for many groups of genes? I make an example here.

>gene1_human
ATTTGCGTGACTGACTGC
>Gene2_human
GCGCGCATGATCCGATGACTG
>gene3_human
TGATACGATGCTGACTGACTGAC
......

>gene1_fly
ATTTGCGTGACTCTGC
>Gene2_fly
GCGCATGATCCGATGACTG
>gene3_fly
TGATACGATGCTGACTGTGAC
......

>gene1_worm
ATTTGCGTGACTCTGaC
>Gene2_worm
GCGCATGATCCGATGccACTG
>gene3_worm
TgtgGATACGATGCTGACTGTGAC
...

>gene1_mouse
ATTTGCGTGACTCTGaC
>Gene2_mouse
GCGCATGATCCGATGccACTG
>gene3_mouse
TgtgGATACGATGCTGACTGTGAC
......

I need to separately compare each gene from these species. The output likes below. They have the same length with gaps marked as - Does anyone know how to perform this analysis? Please give me some suggestions. Thank you very much!

Human   ATTTGCGTGACTGACTG-C
Mouse   ATTTGCGTGACT--CTGAC
Worm    ATTTGCGTGACT--CTGAC
Fly     ATTTGCGTGACT--CTG-C

alignment • 3.0k views

ADD COMMENT • link updated 10.9 years ago by Pavel Senin ★ 1.9k • written 10.9 years ago by biolab ★ 1.4k

score 1 · Answer 1 · 2014-01-17

1

Entering edit mode

10.9 years ago

Pavel Senin ★ 1.9k

Will clustalw work for you?

CLUSTAL 2.1 multiple sequence alignment


gene1_worm       ----ATTTGCGTGACT--CTGAC----
gene1_mouse      ----ATTTGCGTGACT--CTGAC----
gene1_fly        ----ATTTGCGTGACT--CTGC-----
gene1_human      ----ATTTGCGTGACTGACTGC-----
gene3_fly        ----TGATACGATGCTGACTG--TGAC
gene3_worm       -TGTGGATACGATGCTGACTG--TGAC
gene3_mouse      -TGTGGATACGATGCTGACTG--TGAC
gene3_human      ----TGATACGATGCTGACTGACTGAC
Gene2_human      GCGCGCATGATCCGATGACTG------
Gene2_fly        --GCGCATGATCCGATGACTG------
Gene2_worm       --GCGCATGATCCGATGCCACTG----
Gene2_mouse      --GCGCATGATCCGATGCCACTG----
                       :*..   ..*  *:

edit: yes, MUSCLE is another option, especially, it the sequences vary in their length

Gene2_human      ---GCGCGCA----TGATCCGATG--ACTG
Gene2_fly        -----GCGCA----TGATCCGATG--ACTG
Gene2_worm       -----GCGCA----TGATCCGATGCCACTG
Gene2_mouse      -----GCGCA----TGATCCGATGCCACTG
gene3_human      ---TGATACGATGCTGACTGACTG--AC--
gene3_worm       TGTGGATACGATGCTGACTG--TG--AC--
gene3_mouse      TGTGGATACGATGCTGACTG--TG--AC--
gene3_fly        ---TGATACGATGCTGACTG--TG--AC--
gene1_human      ---ATTTGCG----TGACTGACTG---C--
gene1_fly        ---ATTTGCG----TGACTC--TG---C--
gene1_worm       ---ATTTGCG----TGACTC--TG--AC--
gene1_mouse      ---ATTTGCG----TGACTC--TG--AC--
                         *     ***     **   *

ADD COMMENT • link 10.9 years ago by Pavel Senin ★ 1.9k

0

Entering edit mode

Thanks! I just worry that I have many genes, when clustering all sequences together, some of them may not be well aligned. for example, the last C in fly and human gene1 are not aligned very well. Ideally I can run multi-sequence alignment for each gene in batch. What's your ideas? I don't know the CLUSAL algorithm.

ADD REPLY • link 10.9 years ago by biolab ★ 1.4k

0

Entering edit mode

clustering, as the process, is somewhat different from multiple alignment. you can pre-process you large dataset (how many sequences?) with cd-hit - which will cluster, i.e. partition, it (can handle large data) - then you can apply clustalw to clusters in order to obtain multiple alignments

ADD REPLY • link 10.9 years ago by Pavel Senin ★ 1.9k

0

Entering edit mode

Hi Pavel, I have ~3000 sequences. What tools for the cd-hit command? Could you please say a liitle bit more on the partition or what tools can be used for pre-process so many sequences? One of my further question is that can I run CLUSTALX(W) by command? In this way I can run it 3000 times. Thank you very much!

ADD REPLY • link 10.9 years ago by biolab ★ 1.4k

0

Entering edit mode

You could try to run clastalw on your set of 3K sequences as it is, I guess. CD-HIT is the software used to partition (cluster) a large dataset into groups of similar sequences. Configured by a similarity threshold and other parameters you can adjust the way it makes those groups. Assuming that sequences within a group (a cluster) are similar - clustalw will perform multiple sequence alignment for them significantly faster. Yes, you can install clustalw on your computer, and you don't need to run it 3000 times, just once.

ADD REPLY • link 10.9 years ago by Pavel Senin ★ 1.9k

0

Entering edit mode

Hi Pavel, I have one more question. I have just installed Clustalw. However, after typing clustalw I found a file input window pop up, then another window pop up. It's a step-by-step mode. How can I run it once by command? I can prepare 3000 gene sequence files, but don't know how to run clustalw in batch? When you have free time, could you please write to me a command for batch running clustalw (default parameters are ok)? Thank you in advance!!

ADD REPLY • link 10.9 years ago by biolab ★ 1.4k

1

Entering edit mode

IMO muscle is way better than clustal..

ADD REPLY • link 10.9 years ago by 5heikki 11k

0

Entering edit mode

sure! let's put that example too.

ADD REPLY • link 10.9 years ago by Pavel Senin ★ 1.9k

0

Entering edit mode

Of course, these algorithms are designed for homologs, somehow I'm getting the idea that you're trying to align non-related sequences, which wouldn't make any sense in almost any context. More than that, if these are protein-coding genes, you should be aligning amino acids instead of nucleotides..

ADD REPLY • link 10.9 years ago by 5heikki 11k

0

Entering edit mode

Hi 5heikki, I need to align nucleotide sequences rather than protein sequences. I am using targetscan to find miRNA targets in various species. The input file should be mutisequence alignment. Thanks!

ADD REPLY • link 10.9 years ago by biolab ★ 1.4k