Question

clustering and assembly

1

Entering edit mode

10.0 years ago

mina ▴ 20

Hi everyone

I am confused about the meaning of clustering of transcriptome sequence and assembly of transcriptome sequence.

Based on what I understand, assembly means overlapped reads join to each other in order to form a full or partially sequence of mRNA. Am I right?

Clustering is categorizing the set of homologous gene(expressed mRNA in transcriptome data). Am I right?

Clustering should be done after assembly but why we do clustering for transcriptome data?

It might seems silly, but what is different between clustering and Multiple sequence alignment? Both shows the sequences similarity.

English is not my first language, so please excuse any mistakes.

Thanks in forward.

Regards

clustering Assembly • 4.1k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by mina ▴ 20

Ram · Answer 1 · 2014-12-16

Aligning more than two sequences is called multiple sequence alignment. Clustering is something where you are grouping the sequences based on similarity. Here is a toy example

A C C T A C _ _
A C T T A C _ _
A _ _ T A C G T
A _ _ A A C G T

Suppose you are aligning these four sequences like this. The first and second are almost same sequence with one substitution. Similarly 3rd and 4th are similar. So 1st and 2nd can be grouped together. Similarly 3rd and 4th can be also grouped together.

Ram · Answer 2 · 2014-12-16

1

Entering edit mode

10.0 years ago

Janake ▴ 170

The reason we do clustering after assembly:

After assembling short reads, you will get transcripts. One gene can have more than one transcript, depending on different splicing. As mentioned in the previous answer, clustering help to put all these similar sequences together and help to make a set of transcript for one gene. Also, assembly process can create some transcripts that are not real (for e.g., sequences with more than 95% identical to another sequence in the cluster) and clustering helps identifying them.

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Janake ▴ 170

0

Entering edit mode

Thank you sbdk82 and Janak

It helps me, but how clustering helps to recognize the fake transcripts that have 95% identity with real one?

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by mina ▴ 20

0

Entering edit mode

Perhaps, you can take a look at the following program:

http://weizhongli-lab.org/cd-hit/

ADD REPLY • link 10.0 years ago by Janake ▴ 170