I am mapping an RNAseq dataset (illumina 100bp PE) to a robust genome, but have only ~10 million per sample to cover 217,000 exons. This was by design as I'm primarily interested in differential expression, but my default mapping stage is with Tophat and a full gtf annotation file, and this salami-slices the abundance counts down quite significantly (a large proportion of exons finish with 0 reads).
My first thought was to investigate the genes as single units, ignoring splice variation. I thought to grep out 'gene' units from a gff file, convert to a gtf and provide to tophat, but it appears that gffread from cufflinks will only convert exons and CDS (unless I'm missing a flag). Also, I'm aware that Tophat is specifically based around splice variants and whether the fudge would break it. Alternatively, I imagine using the exon-mapped reads and combining the read counts per-gene would be more appropriate. Is there a common tool for the process that anyone can point at?
Otherwise, would a different mapper than Tophat be better (one not focused on splice variants).
Thanks
how did you produce the exon counts in the first place? I can't understand your problem very well. are you worried that you're losing many counts by some splicing specific procedure by tophat? actually if you provide tophat with a gtf file what the program does is to compute the genes transcripts as continuous sequences and align the reads to them. so in that case you would not be losing reads due to splicing. in general if you want to do a gene level analysis tophat + htseq-count + deseq2 is a very suitable pipeline, even with 10M reads.