I am using Stringtie2 in order to obtain a better annotation of chicken genome (galGal6) using long-reads data (oxford nanopore = ONT). I visualized some gene of interest with IGV, but I found there are some conflicts that I am struggling to interpret.
In the first example, we have the following tracks (from up to down):
- 1- the averall coverage of the ONT reads ;
- 2- the ONT reads ;
- 3- Gene (NCBI reference): the reference (original) GTF file
- 4- GTF obtained with merged stringtie (NCBI ref merged with stringtie results from 5)
Result of this command :
stringtie --merge -p 20 -G $GFF -o merged_stringtie.out.gtf lr_guided_allbam.out.gtf
- 5- GTF obtained with stringtie (ONT data + ref NCBI)
Result of this command :
stringtie -L -p 20 -G $GFF -o lr_guided_allbam.out.gtf $INPUT
- 6- GTF obtained with stringtie (ONT data only)
Result of this command :
stringtie -L -p 20 -o lr_allbam.out.gtf $INPUT
The first question is how come there is such a conflict between 4 and 5 ? I could not find in stringtie documentation what could explain this situation. It looks like Stringtie is giving the priority to the original reference when there is a conflict. Is that right ? Is there a way to modify / quantify these situtations with stringtie or any other tool ?
In this second example, the tracks are ordered the same may as above (and same stringtie command for each track). Here, the result of the merging (track #4) is very satisfying: the gene have been elongated towards the 3' end. However, when you compare tracks #5 and #6: why the signal detected in the 2 cases are so different ? In the #5, we would have expected stringtie to add an annotation in the 5' part of the gene (because there is some signal detected, as shown in track #6).
Thanks a lot for you help in understanding these results.
Thanks for your reply. That's what I am investigating now. I wondered if this was an expected behavior of Stringtie, but it seems that it's not so obvious.
I would label this as a 'quirk' of your dataset and how StringTie functions. These things are expected in bioinformatics - no single program or algorithm can account for the respective intricacies of each dataset. Looking at your screenshot, the coverage over that region is not high, so, that may be the key factor in this case. Are those reads primary or secondary alignments?; what is their MAPQ? If you hover the mouse cursor over them, you'll see more info.