Being a novice in the field of RNA-seq I've run into some trouble when going attempting to do transcript assembly from paired-end reads in a rather dense genome (a virus). The issue is that I'm only getting one transcript which spans almost the entire genome - a result which I find highly unlikely given previous annotations of the same genome.
In more detail, this is what I have done:
I have aligned the data using tophat, with the following command:
tophat2 --library-type fr-secondstrand -p 8 [my reference] [fastq1] [fastq2]
I then try to assemble the transcripts using cufflinks in this way:
cufflinks -p 8 --library-type fr-secondstrand --max-bundle-frags 2000000 --upper-quartile-norm [the sample bam]
I've been using the "--max-bundle-frags" option since I have a very high coverage in some of the samples. And the "--upper-quartile-norm" in hope that it would make the program more sensitive to low abundance transcripts.
This is where the trouble starts. First of all I get the following warning for the samples with fewer reads:
Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges. It is recommended that correct parameters (--frag-len-mean and --frag-len-std-dev) be provided.
However, setting the recommended --frag-len-mean does not work, as this is now a option only allowed for unpaired reads. Hoping that this would work anyway, I still checked the results.
This is where I discover that I only find one transcript for most of the samples. In some cases I get no transcripts at all. My intuition is that this is caused be the genome being very dense, which leads to problems for the program to correctly identify where one transcript end and another one begins. However, this might not be right at all. :)
Another thing that I think might be causing the trouble is the relatively low number or reads in some of the samples. The number of mapped reads ranges from 16819 to 42077004, yielding a coverage between ca. < 60 X to 3561 X.
So, to try to summarize I'm looking for information relating to these points:
- What is the minimum recommended coverage needed to be able to resolve transcripts?
- Are there any specific things to take into account when assembling transcripts from a dense genome?
- How serious is the warning about "Using default Gaussian distribution"? And what problems might this cause downstream?
Of course any other information that might help, other than the points above, would be highly welcome. Sorry about the slightly lengthy post, but I wanted to try to get as much information as possible in here, since I'm not experienced enough in this field to be more precise in my question.
(Cross posted here: http://biosupport.se/questions/154/transcript-assembly-in-dense-genome-using-cufflinks)
Is it an RNA virus? In that case it would not be surprising to get a single contig. Does the virus have a circular genome? Then paired reads might map on each side of the origin and confuse Cufflinks' fragment length estimation.
It's a DNA virus with a linear genome.