From a bit of googling around and talking to others I think the problem is in you GTF file. I had a pick at your GTF file. This is the head
5 protein_coding exon 1579 3920 . - . gene_id "GRMZM2G356204"; gene_version "1"; transcript_id "GRMZM2G356204_T01"; transcript_version "1"; exon_number "1"; seqedit "false";
5 protein_coding CDS 1684 3903 . - 0 gene_id "GRMZM2G356204"; gene_version "1"; transcript_id "GRMZM2G356204_T01"; transcript_version "1"; exon_number "1"; protein_id "GRMZM2G356204_P01";
5 protein_coding start_codon 3901 3903 . - 0 gene_id "GRMZM2G356204"; gene_version "1"; transcript_id "GRMZM2G356204_T01"; transcript_version "1"; exon_number "1";
5 protein_coding stop_codon 1681 1683 . - 0 gene_id "GRMZM2G356204"; gene_version "1"; transcript_id "GRMZM2G356204_T01"; transcript_version "1"; exon_number "1";
5 protein_coding exon 22933 23193 . - . gene_id "GRMZM2G054378"; gene_version "1"; transcript_id "GRMZM2G054378_T02"; transcript_version "1"; exon_number "1"; seqedit "false";
5 protein_coding CDS 22933 23193 . - 0 gene_id "GRMZM2G054378"; gene_version "1"; transcript_id "GRMZM2G054378_T02"; transcript_version "1"; exon_number "1"; protein_id "GRMZM2G054378_P02";
5 protein_coding start_codon 23191 23193 . - 0 gene_id "GRMZM2G054378"; gene_version "1"; transcript_id "GRMZM2G054378_T02"; transcript_version "1"; exon_number "1";
5 protein_coding exon 22723 22798 . - . gene_id "GRMZM2G054378"; gene_version "1"; transcript_id "GRMZM2G054378_T02"; transcript_version "1"; exon_number "2"; seqedit "false";
5 protein_coding CDS 22723 22798 . - 0 gene_id "GRMZM2G054378"; gene_version "1"; transcript_id "GRMZM2G054378_T02"; transcript_version "1"; exon_number "2"; protein_id "GRMZM2G054378_P02";
5 protein_coding exon 22554 22633 . - . gene_id "GRMZM2G054378"; gene_version "1"; transcript_id "GRMZM2G054378_T02"; transcript_version "1"; exon_number "3"; seqedit "false";
And this is the head from mouse gtf file for example
#!genome-build GRCm38.p3
#!genome-version GRCm38
#!genome-date 2012-01
#!genome-build-accession NCBI:GCA_000001635.5
#!genebuild-last-updated 2015-04
1 havana gene 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; gene_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC";
1 havana transcript 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; gene_version "1"; transcript_id "ENSMUST00000193812"; transcript_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; transcript_name "4933401J01Rik-001"; transcript_source "havana"; transcript_biotype "TEC"; tag "basic";
1 havana exon 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; gene_version "1"; transcript_id "ENSMUST00000193812"; transcript_version "1"; exon_number "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; transcript_name "4933401J01Rik-001"; transcript_source "havana"; transcript_biotype "TEC"; exon_id "ENSMUSE00001343744"; exon_version "1"; tag "basic";
1 ensembl gene 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; gene_version "1"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA";
1 ensembl transcript 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; gene_version "1"; transcript_id "ENSMUST00000082908"; transcript_version "1"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA"; transcript_name "Gm26206-201"; transcript_source "ensembl"; transcript_biotype "snRNA"; tag "basic"; transcript_support_level "NA";
Here is link to GTF specification in general. However RNA-SeQC do say that that is the format the tools expects. In your GTF column 2 isn't "annotation source" rather I think it is biotype (at least that's what it looks like to me). Also in mouse GTF if you count how many transcripts_id vs gene_id in the transcript feature are there you will find ~100,00 to ~45,000. If RNA-SeQC using those tag to count transcript/gene then the number will be different. But I think the problem is in the GTF format that you have. I suggest you have a closer look at you GTF file.
It would be great to have a look at the source of RNA-SeQC to see what's going on, however I couldn't find source anywhere. Maybe ask them directly. If you do find out how RNA-SeQC detects transcripts/gene I'd love to know, because I do use this tool fair bit.
p.s I do find RNA-SeQC very finicky..
Cheers,
@ivanvishnu would you be able to edit you question and mark all your code as code. Its a bit hard to reads and follow what's going on. I can't really see it in your code, but I guess you have done since I don't think RNA-SeQC would have run. As check list though, have you:
picard CreateSequenceDictionary​
on your reference fasta filesamtools
perhaps)all must be located in the same directory
Also you can give
CREATE_INDEX=true
at anytime, but obviously you want it during mark dups run instead of running BuildBamIndexAnd finally I think there is some confusion about RNA-SeQC output. I don't think it outputs gene counts. To count how many reads have aligned to gene you need to run
htseq-count
orfeatureCounts
. Either waysamtools flagstat
is good tool to run as it gives the same information about mapped reads and unmapped reads and in general you can fish out a lot of info from bam files using different-f
option in samtools e.gsamtools -c -f 4
will provide all unmapped reads in the bam file@Kirill: Thanks for the suggestion. I'd both the .dict file and .fai file in the same directory. The number I mentioned was not gene count. As summary statistic, RNASeqQC reports the number of transcripts and genes detected. The numbers as I mentioned in the post are very different.