Question

Transcript Assembly In Dense Genome Using Cufflinks

1

Entering edit mode

12.2 years ago

Johan ▴ 890

Being a novice in the field of RNA-seq I've run into some trouble when going attempting to do transcript assembly from paired-end reads in a rather dense genome (a virus). The issue is that I'm only getting one transcript which spans almost the entire genome - a result which I find highly unlikely given previous annotations of the same genome.

In more detail, this is what I have done:

I have aligned the data using tophat, with the following command:

tophat2 --library-type fr-secondstrand -p 8  [my reference] [fastq1] [fastq2]

I then try to assemble the transcripts using cufflinks in this way:

cufflinks -p 8 --library-type fr-secondstrand --max-bundle-frags 2000000 --upper-quartile-norm [the sample bam]

I've been using the "--max-bundle-frags" option since I have a very high coverage in some of the samples. And the "--upper-quartile-norm" in hope that it would make the program more sensitive to low abundance transcripts.

This is where the trouble starts. First of all I get the following warning for the samples with fewer reads:

Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges.  It is recommended that correct parameters (--frag-len-mean and --frag-len-std-dev) be provided.

However, setting the recommended --frag-len-mean does not work, as this is now a option only allowed for unpaired reads. Hoping that this would work anyway, I still checked the results.

This is where I discover that I only find one transcript for most of the samples. In some cases I get no transcripts at all. My intuition is that this is caused be the genome being very dense, which leads to problems for the program to correctly identify where one transcript end and another one begins. However, this might not be right at all. :)

Another thing that I think might be causing the trouble is the relatively low number or reads in some of the samples. The number of mapped reads ranges from 16819 to 42077004, yielding a coverage between ca. < 60 X to 3561 X.

So, to try to summarize I'm looking for information relating to these points:

What is the minimum recommended coverage needed to be able to resolve transcripts?
Are there any specific things to take into account when assembling transcripts from a dense genome?
How serious is the warning about "Using default Gaussian distribution"? And what problems might this cause downstream?

Of course any other information that might help, other than the points above, would be highly welcome. Sorry about the slightly lengthy post, but I wanted to try to get as much information as possible in here, since I'm not experienced enough in this field to be more precise in my question.

(Cross posted here: http://biosupport.se/questions/154/transcript-assembly-in-dense-genome-using-cufflinks)

cufflinks rna-seq • 5.2k views

ADD COMMENT • link updated 12.2 years ago by Srihari ▴ 30 • written 12.2 years ago by Johan ▴ 890

0

Entering edit mode

Is it an RNA virus? In that case it would not be surprising to get a single contig. Does the virus have a circular genome? Then paired reads might map on each side of the origin and confuse Cufflinks' fragment length estimation.

ADD REPLY • link 12.2 years ago by Mikael Huss 4.8k

0

Entering edit mode

It's a DNA virus with a linear genome.

ADD REPLY • link 12.2 years ago by Johan ▴ 890

score 3 · Answer 1 · 2012-10-22

About transcript assembly from dense genome: do you have gene overlaps? Cufflinks will combine transcripts if overlaps are present.

Another point is if your virus express one big mRNA and then splice in several transcripts (some virus do). Maybe you can plot your genome coverage to see if you are covering 100% of the genome, if yes, cufflinks is working fine and you need other tricks to assemble your transcripts.

score 1 · Answer 2 · 2012-10-23

1

Entering edit mode

12.2 years ago

Srihari ▴ 30

It might be useful to visualize the mappings (the bam files) on a GBrowse or JBrowse instance to verify that cufflinks is right, plus I find it incredibly useful to look at all possible models for a given gene.

ADD COMMENT • link 12.2 years ago by Srihari ▴ 30