Hi all,
I have aligned my RNAseq data to the reference genome and extracted raw count using featureCounts
tool. I used two different GTF file from NCBI, one having only Refseq id as gene_id and other having gene symbol as gene_id.
Refseq as gene_id
chr11 mm10_refGene exon 100883834 100885169 0.000000 + . gene_id "NM_001164062"; transcript_id "NM_001164062";
Gene symbol as gene_id
chr11 stdin exon 100883834 100885169 . + . gene_id "Stat5a"; transcript_id "NM_001164062"; exon_number "18"; exon_id "NM_001164062.18"; gene_name "Stat5a";
Surprisingly when I looked into the raw count for above feature that I have highlighted is showing different count from featuteCounts
output for these two different GTF file It would be really great if someone could help me understanding these different results or If I am missing something here.
gene_id Count
NM_001164062 0
NM_011488 11
Stat5a 14046
Here is the code that I used to extract the count
featureCounts --verbose -p -B -T 20 -a mm10_refseq.gtf -o Emp Emp_R1.sorted.bam
and IGV browser snapshot of the bam coverage at the highlighted features.
There are probably many differences between your GTF files in addition to the one line you posted. Based on the formatting, they don't look like they are actually from NCBI.
Based on that screenshot, there might be something wrong with your GTF file. The exons should be connected like they are in the RefSeq track.
Actually both the gtf file were downloaded from UCSC table browser, First one is NCBI refseq track from table UCSC Refseq (refgene) downloaded as GTF file and second one downloaded as "all fields from selected table" and further converted to GTF format using
genePredToGtf
I checked several features and they are having same number of count from both the gtf file but only few features I guess is behaving differently.
I am not sure how the table browser works in this particular case, but it's possible that it does not convert between formats properly. The initial table may not have all the necessary fields. I would use proper GTF files.
Also, I would modify the initial question to specify how you generated the GTF files. Although some of the contents may be from NCBI, the files are not.
What I understood that there is some issue with NCBI REFseq GTF file downloaded from UCSC table browser.
The second GTF file having gene symbol as gene_id i generated using following line of code