Hi, I am trying to do RNA Seq differential expression using StringTie and then either Ballgown or DESEQ2. I have followed the StringTie manual and have used the gencode.gtf as a reference annotation file: stringtie -p 4 -G ../gencode.gtf -o ./stringtie/sample1.gtf ./sample1.bam
This produces a gtf file for all samples like so:
1 StringTie transcript 131025 134836 1000 + . gene_id "STRG.1"; transcript_id "STRG.1.1"; reference_id "ENST00000442987.3"; ref_gene_id "ENSG00000233750.3"; ref_gene_name "CICP27"; cov "0.097128"; FPKM "0.038404"; TPM "0.050769";
etc
I then used the merge option (stringtie --merge -p 4 -G ../../gencode.gtf -o stringtie_merged.gtf ../mergelist.txt
)
to merge all of the transcript information from all of the samples to create a 'master' gtf file that in my understanding represents all of the feature information(an annotation file of the transcriptome for my RNA seq data). e.g.
1 HAVANA transcript 131025 134836 . + . gene_id "ENSG00000233750.3"; transcript_id "ENST00000442987.3"; gene_name "CICP27"; ref_gene_id "ENSG00000233750.3";
I then ran stringtie again using the merged gtf file to obtain a final gtf file with FPKM and coverage for all samples, eg
stringtie -e -B -p 4 -G ./stringtie_merged.gtf -o ballgown/sample1/sample1.gtf ./sample1.bam
To produce a final sample gtf:
1 StringTie transcript 131025 134836 1000 + . gene_id "ENSG00000233750.3"; transcript_id "ENST00000442987.3"; ref_gene_name "CICP27"; cov "0.126465"; FPKM "0.053889"; TPM "0.082197";
Which is all good. However, for this final sample gtf against all of the HAVANA and ENSEMBL entries the FPKM and coverage all = 0 eg
1 HAVANA exon 129055 129173 . - . gene_id "ENSG00000238009.6"; transcript_id "ENST00000471248.1"; exon_number "3"; ref_gene_name "RP11-34P13.7"; cov "0.0";
1 ENSEMBL transcript 120725 133723 . - . gene_id "ENSG00000238009.6"; transcript_id "ENST00000610542.1"; ref_gene_name "RP11-34P13.7"; cov "0.0"; FPKM "0.000000"; TPM "0.000000";
My question is why are these HAVANA/ENSEMBL entries there? My transcript for position 131025 in the example above originally came from the gencode.gtf but in the final gtf is labelled with Stringtie which I don't particularly care about but why are there lots of entries for HAVANA/ENSEMBL with values of zero in the final sample gtf?? When I first saw this final sample gtf my first thoughts were it was due to the reference.fasta/BAM files not being numerically ordered as was suggested by this post (see threads #4 and #6) http://seqanswers.com/forums/showthread.php?t=8218