Question

RNA-seq assembly data not matching reference (cufflinks)

0

Entering edit mode

5.5 years ago

dodausp ▴ 190

Hi! I have recently posted a question on this matter, but this one is quite a different issue. I am using cufflinks to quantitate my RNA-seq data. And in order to get the gene names, the -g argument is enabled, with the proper gtf reference file. In short, the command is the following:

cufflinks -b hg19.fa -g hg19.refGene.gtf -u [sample.bam]

All the output files are generated (genes.fpkm_tracking, isoforms.fpkm_tracking, skipped.gtf, transcripts.gft). However, when I examined the *genes.fpkm_tracking file, most of the targets are not annotated (they are presented as CUFF.[ID number]). I thought that it had something to do with the reference gtf file. And in fact, it seems that the correct annotation is missed by one base. As an example, these are the locations of a few genes on the genes.fpkm_tracking file that cufflinks is not able to map:

gene_id locus

CUFF.4031 Chr14:21,819,635-21,852,178

CUFF.2756 Chr11:108,093,558-108,239,829

CUFF.6430 Chr17:41,196,311-41,277,382

And these are the locations of real genes, according to the hg19.refGene.gtf file:

gene_id locus

SUPT16H Chr14:21,819,636-21,852,178

ATM Chr11: 108,093,559-108,239,826

BRCA1 Chr17:41,196,312-41,277,381

Note: Although the gtf file contains various transcripts for the same gene, I am just using one example per gene (showing those starting from the very first base).

This is observed over a few thousands of transcripts (the difference of one base, it is). So my questions are:

1. What seems to be the problem here?

2. Is there any argument in cufflinks one can use to fix this?

3. Is there any other way to solve it?

As always, any insight is greatly appreciated. (:

Thanks!

RNA-Seq alignment assembly • 841 views

ADD COMMENT • link 5.5 years ago by dodausp ▴ 190