Aligning more than two sequences is called multiple sequence alignment. Clustering is something where you are grouping the sequences based on similarity. Here is a toy example
A C C T A C _ _
A C T T A C _ _
A _ _ T A C G T
A _ _ A A C G T
Suppose you are aligning these four sequences like this. The first and second are almost same sequence with one substitution. Similarly 3rd and 4th are similar. So 1st and 2nd can be grouped together. Similarly 3rd and 4th can be also grouped together.
After assembling short reads, you will get transcripts. One gene can have more than one transcript, depending on different splicing. As mentioned in the previous answer, clustering help to put all these similar sequences together and help to make a set of transcript for one gene. Also, assembly process can create some transcripts that are not real (for e.g., sequences with more than 95% identical to another sequence in the cluster) and clustering helps identifying them.
ADD COMMENT
• link
updated 2.8 years ago by
Ram
44k
•
written 10.0 years ago by
Janake
▴
170
0
Entering edit mode
Thank you sbdk82 and Janak
It helps me, but how clustering helps to recognize the fake transcripts that have 95% identity with real one?
ADD REPLY
• link
updated 2.8 years ago by
Ram
44k
•
written 10.0 years ago by
mina
▴
20
0
Entering edit mode
Perhaps, you can take a look at the following program:
Thank you sbdk82 and Janak
It helps me, but how clustering helps to recognize the fake transcripts that have 95% identity with real one?
Perhaps, you can take a look at the following program:
http://weizhongli-lab.org/cd-hit/