Question

Why does stringtie --merge remove potential novel transcript isoforms if they are substrings of longer isoforms?

1

Entering edit mode

8.3 years ago

vegard nygaard ▴ 320

Hi,

I am trying the hisat2 -> stringtie -> ballgown workflow. I have 46 samples from 2 tissues. I run stringtie 3 times as described in http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html, approximately like this:

stringtie sample1.bam -G genes_ensembl_GRCh37_20150717.gtf -o sample1_run1.gtf #for each sample

stringtie --merge -G genes_ensembl_GRCh37_20150717.gtf -o merged.gtf listofrun1gtfs.txt

stringtie sample1.bam -G merged.gtf -e -B ballgowndirsample1 -o sample1_run3.gtf

However when I, for one gene of particular interest, compared the isoform expressions after the stringtie run to the aligned reads from the bam file in IGV I saw a problem: A commonly called isoform consisting of 27 exons did not have any expression for the last 6 exons, while the first 21 exons had a coverage between 100 and 500. This pattern was clear for all samples of that tissue and different from the pattern of the other tissue.

After some troubleshooting I saw that the shorter potentially novel 21-exon isoform was called in the first run of stringtie, however the merge step removed it. And I think the reason was because it was a substring of the larger 27 exon isoform. Figure 2 in the paper indicate that this might be the intended behavior of the merging. I also tested merging only one gtf file and indeed the shorter isoform disapeared while the 27 exons isoform was still in.

So, have I misunderstood some parameter settings in stringtie? Or is it biologically impossible for one isoform to be a substring of another longer isoform? Or is this a limitation with stringtie?

Vegard

RNA-Seq isoforms • 5.5k views

ADD COMMENT • link updated 8.1 years ago by dgrover6 ▴ 10 • written 8.3 years ago by vegard nygaard ▴ 320

score 1 · Answer 1 · 2016-10-03

1

Entering edit mode

8.1 years ago

dgrover6 ▴ 10

According to stringtie manual, in the merge option:

If the -G option (reference annotation) is provided, StringTie will assemble the transfrags from the input GTF files with the reference transcripts.

It seems reference annotation is optional during merge. One way is to try without and see if you can retain novel or not.

ADD COMMENT • link 8.1 years ago by dgrover6 ▴ 10

1

Entering edit mode

Thanks for a useful suggestion. I have tried without the reference annotation, however the problem is really the same. A novel transcript will be discarded in the merge step if it is a substring of another transcript (novel or from annotation).

I got an answer from the developer here.

It seems to be a limitation with stringtie. The reason for the removal of the shorter transcripts is to avoid counting many weakly (partially captured) expressed parts of one transcript as many different transcripts. So avoiding one type of error may lead to another type of error. The developer will look into it an maybe include an expression criteria in later versions.

Anyhow, this is a high throughput method and a gene can take many different exon-configurations even within one sample. There are bound to be problem situations where the method fails. If I end up with a limited number of interesting genes in my analysis, I will inspect those more closely, for instance with an exon heatmap plot.

ADD REPLY • link 8.1 years ago by vegard nygaard ▴ 320