Question

Cufflinks genome guided de novo transcript assembly- increase in ambiguous reads

0

Entering edit mode

9.1 years ago

sunil.mangalam ▴ 10

Hi forum,

I am doing denovo transcript assembly using cufflinks.In parellel, I am using featurecounts to count against Gencode M7. My data is 100bpPE and unstranded mouse RNASeq.

Our workflows are: STAR-Featurecounts (against Gencode M7) or STAR-Cufflinks Denovo transcript assembly-cuffmerge - Featurecounts (against Denovo GTF file).

Comparing our counts from Gencode M7 vs Denovo GTF, I see a decrease (~40%) in Unassigned_NoFeatures reads in the data from denovo gtfs, which is encouraging because cufflinks is probably detecting new transcripts (or extending the exon boundaries of already known transcripts).

But at the same time, I find a huge increase (~ 400%) in reads that are unassigned_Ambiguity.This seems to have something to do with these samples being unstranded because when I align some 100bp SE stranded data, I get a decrease in both ambiguous and Unassigned_NoFearures reads, and an increase in total transcript assigned counts.

The code for cufflinks we used for each sample generally looks like this:

bsub -P ssTissue -n 12 \
    -M 250000 \
    -o /home/jtobias/ss/cuff_noM/logs/CV_CV1_cuff.out \
    -e /home/jtobias/ss/cuff_noM/logs/CV_CV1_cuff.err \
    cufflinks -p 8 \
    --max-bundle-frags 300000 \
    -q \
    --library-type fr-unstranded \
    -o /home/jtobias/ss/cuff_noM/CV_CV1 /home/ss/bam_noM/CV_CV1_noM.bam

and for feature counts against Gencode M7 or denovo gtf:

bsub -q max_mem30 -n 12 \
    featureCounts -T 12 \
    -t exon \
    -g gene_id \
    -a ~/m7/m7gtf.gtf \
    -s 1 \
    -o /home/ss/counts/SEgencodeGene.txt

Pretty much the standard options. Looks like we are loosing counts against known exons when we use denovo gtf. Has any one experienced this before? Any help is much appreciated!

denovo RNA-Seq assembly • 3.0k views

ADD COMMENT • link updated 3.7 years ago by Ram 44k • written 9.1 years ago by sunil.mangalam ▴ 10

Ram · Answer 1 · 2015-10-31

Why are you assembling denovo when there's a perfectly good reference genome? If you're looking for novel things, then cufflinks will do that for you, using the reference genome to inform it's decisions (as far as I can work out, at least). If you're missing transcripts being assembled over well known exons, then that's down to Cufflink's methodology on transcript assembly. Would it be worth comparing three cufflinks runs: denovo (as you've already done), against the reference genome with no novel detection, and against the reference genome with novel detection enabled?