Question

Mapping low coverage RNAseq at gene-level rather than exon level

0

Entering edit mode

9.5 years ago

Daniel ★ 4.0k

I am mapping an RNAseq dataset (illumina 100bp PE) to a robust genome, but have only ~10 million per sample to cover 217,000 exons. This was by design as I'm primarily interested in differential expression, but my default mapping stage is with Tophat and a full gtf annotation file, and this salami-slices the abundance counts down quite significantly (a large proportion of exons finish with 0 reads).

My first thought was to investigate the genes as single units, ignoring splice variation. I thought to grep out 'gene' units from a gff file, convert to a gtf and provide to tophat, but it appears that gffread from cufflinks will only convert exons and CDS (unless I'm missing a flag). Also, I'm aware that Tophat is specifically based around splice variants and whether the fudge would break it. Alternatively, I imagine using the exon-mapped reads and combining the read counts per-gene would be more appropriate. Is there a common tool for the process that anyone can point at?

Otherwise, would a different mapper than Tophat be better (one not focused on splice variants).

Thanks

RNA-Seq • 2.4k views

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.5 years ago by Daniel ★ 4.0k

0

Entering edit mode

how did you produce the exon counts in the first place? I can't understand your problem very well. are you worried that you're losing many counts by some splicing specific procedure by tophat? actually if you provide tophat with a gtf file what the program does is to compute the genes transcripts as continuous sequences and align the reads to them. so in that case you would not be losing reads due to splicing. in general if you want to do a gene level analysis tophat + htseq-count + deseq2 is a very suitable pipeline, even with 10M reads.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by Martombo ★ 3.1k

Ram · Accepted Answer · 2015-05-19

2

Entering edit mode

9.5 years ago

jshall ▴ 40

You're probably best letting your mapper (ie Tophat or something else like STAR) have all the splice junction information found in the original GFF, it'll be better able to align reads along known junctions. If you take the BAM files output by your mapper and run them through htseq-count, by default it will sum up all the hits to various exons to provide gene-level counts.

Basically, for DE its most common to map at the exon level and quantify at the gene level.

ADD COMMENT • link updated 22 months ago by Ram 44k • written 9.5 years ago by jshall ▴ 40

0

Entering edit mode

Thanks, I've got HTSeq running now. On my first pass I used cufflinks and cummerbund and I got some weird results which prompted this question. The last time I transcriptome assembly was 7 years ago and things have changed a fair bit! Also, for my sanity, what does 'DE' mean in your last sentence?

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by Daniel ★ 4.0k

0

Entering edit mode

Oh, DE = Differential Expression.

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by jshall ▴ 40

0

Entering edit mode

Ah, of course! thanks!

ADD REPLY • link updated 22 months ago by Ram 44k • written 9.5 years ago by Daniel ★ 4.0k