Hi!
I have recently posted a question on this matter, but this one is quite a different issue. I am using cufflinks to quantitate my RNA-seq data. And in order to get the gene names, the -g
argument is enabled, with the proper gtf
reference file. In short, the command is the following:
cufflinks -b hg19.fa -g hg19.refGene.gtf -u [sample.bam]
All the output files are generated (genes.fpkm_tracking, isoforms.fpkm_tracking, skipped.gtf, transcripts.gft). However, when I examined the *genes.fpkm_tracking file, most of the targets are not annotated (they are presented as CUFF.[ID number]). I thought that it had something to do with the reference gtf
file. And in fact, it seems that the correct annotation is missed by one base. As an example, these are the locations of a few genes on the genes.fpkm_tracking file that cufflinks is not able to map:
gene_id locus
CUFF.4031 Chr14:21,819,635-21,852,178
CUFF.2756 Chr11:108,093,558-108,239,829
CUFF.6430 Chr17:41,196,311-41,277,382
And these are the locations of real genes, according to the hg19.refGene.gtf
file:
gene_id locus
SUPT16H Chr14:21,819,636-21,852,178
ATM Chr11: 108,093,559-108,239,826
BRCA1 Chr17:41,196,312-41,277,381
Note: Although the gtf
file contains various transcripts for the same gene, I am just using one example per gene (showing those starting from the very first base).
This is observed over a few thousands of transcripts (the difference of one base, it is). So my questions are:
1. What seems to be the problem here?
2. Is there any argument in cufflinks one can use to fix this?
3. Is there any other way to solve it?
As always, any insight is greatly appreciated. (:
Thanks!