Question

How to define a full-length transcript for transcriptome assembly?

0

Entering edit mode

10.5 years ago

Shaojiang Cai ▴ 100

Hi, I am using some RNA-seq library to test my assembler. Now what I am wondering is: HOW can we say a transcript is there?

In RNA-seq libraries, can the reads be from UTR regions?
If above is true, will the full UTR regions always be fully covered (or fragmented)?
Usually, can I say, "transcript A is expressed because its full coding regions are assembled?"

Thanks.

RNA-Seq UTR • 4.3k views

ADD COMMENT • link updated 3.0 years ago by Ram 44k • written 10.5 years ago by Shaojiang Cai ▴ 100

Ram · Accepted Answer · 2014-04-24

2

Entering edit mode

10.5 years ago

Charles Warden 8.3k

At least in my experience, I think full coverage of the coding region in a single assembled transcript is probably difficult to achieve. This is part of why I would always prefer a direct alignment over de novo assembly (when a reference is available). When working with assembled transcripts, I would favor using a partial contig as a proxy for expression of the relevant gene (rather than requiring a full coding region to be present in the assembly)..

Yes, you will have reads from UTRs. Just like the coding regions, my guess is that that long, high-quality (and not incorrectly stitched) contigs will not necessarily cover all the real UTRs as a contiguous extension of the coding region transcript.

If it helps, I've collected a list of pointer for a slightly different assembly question in this blog post.

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 10.5 years ago by Charles Warden 8.3k

0

Entering edit mode

Hi Charles,

Would you recommend Trinity for 454 sequences then? If yes, how can we define the start and end of the transcript?

Thanks.

ADD REPLY • link updated 3.0 years ago by Ram 44k • written 10.4 years ago by MAPK ★ 2.1k

0

Entering edit mode

You should ask the developers about 454 sequences. My guess is that you would at least need to change some parameters.

There is also an FAQ page.

I'm not sure if I understand your second question. Based upon my experience with Illumina data, one problem I have within Trinity is that is seemed to inappropriately stitch unrelated sequences. Also, the RNA-Seq data that I see typically don't have complete or even coverage across known transcripts (when aligned to a reference instead of doing de novo assembly), which is why I think using coverage of a well-defined but partial sequence is better for differential expression purposes. In general, depth coverage of reads aligned to the assembly and uniformity of coverage across that assembly are quality control metrics to assess the quality of the assembly. Unless the sequencing technology directly produces reads that span the whole transcript and you can be absolutely certain that the RNA didn't get fragmented prior to assembly, I can't think of a specific reason why analysis strategies would be fundamentally different (and, in that scenario, there wouldn't be a need for de novo assembly in the first place).

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 10.4 years ago by Charles Warden 8.3k