Entering edit mode
10.9 years ago
biolab
★
1.4k
Dear all, How to make muli-sequence alingment for many groups of genes? I make an example here.
>gene1_human
ATTTGCGTGACTGACTGC
>Gene2_human
GCGCGCATGATCCGATGACTG
>gene3_human
TGATACGATGCTGACTGACTGAC
......
>gene1_fly
ATTTGCGTGACTCTGC
>Gene2_fly
GCGCATGATCCGATGACTG
>gene3_fly
TGATACGATGCTGACTGTGAC
......
>gene1_worm
ATTTGCGTGACTCTGaC
>Gene2_worm
GCGCATGATCCGATGccACTG
>gene3_worm
TgtgGATACGATGCTGACTGTGAC
...
>gene1_mouse
ATTTGCGTGACTCTGaC
>Gene2_mouse
GCGCATGATCCGATGccACTG
>gene3_mouse
TgtgGATACGATGCTGACTGTGAC
......
I need to separately compare each gene from these species. The output likes below. They have the same length with gaps marked as - Does anyone know how to perform this analysis? Please give me some suggestions. Thank you very much!
Human ATTTGCGTGACTGACTG-C
Mouse ATTTGCGTGACT--CTGAC
Worm ATTTGCGTGACT--CTGAC
Fly ATTTGCGTGACT--CTG-C
Thanks! I just worry that I have many genes, when clustering all sequences together, some of them may not be well aligned. for example, the last C in fly and human gene1 are not aligned very well. Ideally I can run multi-sequence alignment for each gene in batch. What's your ideas? I don't know the CLUSAL algorithm.
clustering, as the process, is somewhat different from multiple alignment. you can pre-process you large dataset (how many sequences?) with cd-hit - which will cluster, i.e. partition, it (can handle large data) - then you can apply clustalw to clusters in order to obtain multiple alignments
Hi Pavel, I have ~3000 sequences. What tools for the cd-hit command? Could you please say a liitle bit more on the partition or what tools can be used for pre-process so many sequences? One of my further question is that can I run CLUSTALX(W) by command? In this way I can run it 3000 times. Thank you very much!
You could try to run clastalw on your set of 3K sequences as it is, I guess. CD-HIT is the software used to partition (cluster) a large dataset into groups of similar sequences. Configured by a similarity threshold and other parameters you can adjust the way it makes those groups. Assuming that sequences within a group (a cluster) are similar - clustalw will perform multiple sequence alignment for them significantly faster. Yes, you can install clustalw on your computer, and you don't need to run it 3000 times, just once.
Hi Pavel, I have one more question. I have just installed Clustalw. However, after typing
clustalw
I found a file input window pop up, then another window pop up. It's a step-by-step mode. How can I run it once by command? I can prepare 3000 gene sequence files, but don't know how to run clustalw in batch? When you have free time, could you please write to me a command for batch running clustalw (default parameters are ok)? Thank you in advance!!IMO muscle is way better than clustal..
sure! let's put that example too.
Of course, these algorithms are designed for homologs, somehow I'm getting the idea that you're trying to align non-related sequences, which wouldn't make any sense in almost any context. More than that, if these are protein-coding genes, you should be aligning amino acids instead of nucleotides..
Hi 5heikki, I need to align nucleotide sequences rather than protein sequences. I am using targetscan to find miRNA targets in various species. The input file should be mutisequence alignment. Thanks!