I'm using StringTie with Ensembl annotations (GTF-file downloaded from Ensembl FTP --> Gene sets --> GTF) and I'm having an issue with exon variants with slightly different genomic positions. Some exons have start positions that differ with as low as 1bp (e.g. one starts at 1001, another starts at 1002), and the same with the stop-positions. As a result, StringTie gives me two different coverage values, one for each of the exons. I would like to treat two such exons as one and the same, and I'm wondering how to go about it.
I can't find a suitable option in the StringTie manual, so I'm considering altering the annotation; something like finding exons with very small differences in start- or stop-positions, and keep only those with the lowest start position and highest stop-position, and re-run StringTie with the new annotation. Is there something obviously flawed with this approach?
Does anyone know of a way to either:
- Make StringTie treat almost-identical exons as one and the same exon, or
- Change the annotation to only contain the longest variant of each exon?
Thanks!
Thanks, @Macspider, this touces upon my main concern on whether this approach would be "safe"; am I losing something essential? I'm looking at alternative splicing, and if I understand my supervisor correctly, a few bases difference is not really of interest, but rather whether the exon is expressed in a sample or not. The problem I'm encountering is along the lines of this example: An exon
A
appears to be expressed in samples 1-10, but not in sample 11 (according to StringTie'scov
value). Looking more closely at the gene in sample 11, I find there's another exonB
that is similary expressed asA
in samples 1-10, only thatB
in sample 11 begins 1nt before exonA
. So the total expression of all the variants for the exon is very similar in all samples, it's just that another, very slightly different exon variant "gets" all the coverage in sample 11, and this is what I'm trying to address. Did this make sense? EDIT: Also, so far, I've only seen this problem in first and last exons in a transcript, so I'm wondering if this has to do with transcription start sites or polyadenylation differences (and I'm not currently interested in distinguishing exons on that level of detail)My guess is that you have leakage of coverage at the transcript margins, and this biases the output of stringtie, which always tries to reconstruct its own reference. Give a try to cufflinks and see if the results are the same or not with the same annotation!