Transcript Assembly In Dense Genome Using Cufflinks
2
1
Entering edit mode
12.1 years ago
Johan ▴ 890

Being a novice in the field of RNA-seq I've run into some trouble when going attempting to do transcript assembly from paired-end reads in a rather dense genome (a virus). The issue is that I'm only getting one transcript which spans almost the entire genome - a result which I find highly unlikely given previous annotations of the same genome.

In more detail, this is what I have done:

I have aligned the data using tophat, with the following command:

tophat2 --library-type fr-secondstrand -p 8  [my reference] [fastq1] [fastq2]

I then try to assemble the transcripts using cufflinks in this way:

cufflinks -p 8 --library-type fr-secondstrand --max-bundle-frags 2000000 --upper-quartile-norm [the sample bam]

I've been using the "--max-bundle-frags" option since I have a very high coverage in some of the samples. And the "--upper-quartile-norm" in hope that it would make the program more sensitive to low abundance transcripts.

This is where the trouble starts. First of all I get the following warning for the samples with fewer reads:

Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges.  It is recommended that correct parameters (--frag-len-mean and --frag-len-std-dev) be provided.

However, setting the recommended --frag-len-mean does not work, as this is now a option only allowed for unpaired reads. Hoping that this would work anyway, I still checked the results.

This is where I discover that I only find one transcript for most of the samples. In some cases I get no transcripts at all. My intuition is that this is caused be the genome being very dense, which leads to problems for the program to correctly identify where one transcript end and another one begins. However, this might not be right at all. :)

Another thing that I think might be causing the trouble is the relatively low number or reads in some of the samples. The number of mapped reads ranges from 16819 to 42077004, yielding a coverage between ca. < 60 X to 3561 X.

So, to try to summarize I'm looking for information relating to these points:

  • What is the minimum recommended coverage needed to be able to resolve transcripts?
  • Are there any specific things to take into account when assembling transcripts from a dense genome?
  • How serious is the warning about "Using default Gaussian distribution"? And what problems might this cause downstream?

Of course any other information that might help, other than the points above, would be highly welcome. Sorry about the slightly lengthy post, but I wanted to try to get as much information as possible in here, since I'm not experienced enough in this field to be more precise in my question.

(Cross posted here: http://biosupport.se/questions/154/transcript-assembly-in-dense-genome-using-cufflinks)

cufflinks rna-seq • 5.2k views
ADD COMMENT
0
Entering edit mode

Is it an RNA virus? In that case it would not be surprising to get a single contig. Does the virus have a circular genome? Then paired reads might map on each side of the origin and confuse Cufflinks' fragment length estimation.

ADD REPLY
0
Entering edit mode

It's a DNA virus with a linear genome.

ADD REPLY
3
Entering edit mode
12.1 years ago
JC 13k

About transcript assembly from dense genome: do you have gene overlaps? Cufflinks will combine transcripts if overlaps are present.

Another point is if your virus express one big mRNA and then splice in several transcripts (some virus do). Maybe you can plot your genome coverage to see if you are covering 100% of the genome, if yes, cufflinks is working fine and you need other tricks to assemble your transcripts.

ADD COMMENT
1
Entering edit mode
12.1 years ago
Srihari ▴ 30

It might be useful to visualize the mappings (the bam files) on a GBrowse or JBrowse instance to verify that cufflinks is right, plus I find it incredibly useful to look at all possible models for a given gene.

ADD COMMENT

Login before adding your answer.

Traffic: 1846 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6