Hi,
I am trying the hisat2 -> stringtie -> ballgown workflow. I have 46 samples from 2 tissues. I run stringtie 3 times as described in http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html, approximately like this:
stringtie sample1.bam -G genes_ensembl_GRCh37_20150717.gtf -o sample1_run1.gtf #for each sample
stringtie --merge -G genes_ensembl_GRCh37_20150717.gtf -o merged.gtf listofrun1gtfs.txt
stringtie sample1.bam -G merged.gtf -e -B ballgowndirsample1 -o sample1_run3.gtf
However when I, for one gene of particular interest, compared the isoform expressions after the stringtie run to the aligned reads from the bam file in IGV I saw a problem: A commonly called isoform consisting of 27 exons did not have any expression for the last 6 exons, while the first 21 exons had a coverage between 100 and 500. This pattern was clear for all samples of that tissue and different from the pattern of the other tissue.
After some troubleshooting I saw that the shorter potentially novel 21-exon isoform was called in the first run of stringtie, however the merge step removed it. And I think the reason was because it was a substring of the larger 27 exon isoform. Figure 2 in the paper indicate that this might be the intended behavior of the merging. I also tested merging only one gtf file and indeed the shorter isoform disapeared while the 27 exons isoform was still in.
So, have I misunderstood some parameter settings in stringtie? Or is it biologically impossible for one isoform to be a substring of another longer isoform? Or is this a limitation with stringtie?
Vegard
Thanks for a useful suggestion. I have tried without the reference annotation, however the problem is really the same. A novel transcript will be discarded in the merge step if it is a substring of another transcript (novel or from annotation).
I got an answer from the developer here.
It seems to be a limitation with stringtie. The reason for the removal of the shorter transcripts is to avoid counting many weakly (partially captured) expressed parts of one transcript as many different transcripts. So avoiding one type of error may lead to another type of error. The developer will look into it an maybe include an expression criteria in later versions.
Anyhow, this is a high throughput method and a gene can take many different exon-configurations even within one sample. There are bound to be problem situations where the method fails. If I end up with a limited number of interesting genes in my analysis, I will inspect those more closely, for instance with an exon heatmap plot.