Hi,
While going trough the results of KissDE, I noticed a strange repetition of events, that I didn't see before the update of Kiss2refgenome (v2.0.0).
Two differents examples here :
Two IR strictly identical for NUBP2. For both, the genomic position of each splice site (on the lower path) are 1836656 and 1836758. The variable part length is 101 for both.
The only difference come from the genomic blocs size of the upper path : 177 for one, 178 for the other. So unless I am mistaken, I am looking here at the exact same intron retention. But the event has been reported twice by KisSplice, with only 1 base difference in the sequence, not even in the event itself.
EDIT: both set of sequences (bcc_7866|Cycle_2 and bcc_7866|Cycle_13) have a substitution (C>A) at the exact last base, so it is present both in the upper and the lower path. So outside of the intron.
Anothere example, even stranger :
Two IR for TSPAN32. This time there is absolutely no difference. Same bloc size, same splice site, same variable part.
In the end, the only difference I can see, is a little variation in the read coverge. Only one read for only sample in the first example, and one or two read for several samples on the second example. But it is still the same event...
EDIT: both set of sequences (bcc_167629|Cycle_2352655 and bcc_167629|Cycle_2352656) have exactly the same lower path, while there is a substitution in the upper path T>A, so directly in the intron. That explains it I guess.
I might not be clear, so here are the 2 examples with all the data from KissDE : https://docs.google.com/spreadsheets/d/1K9FSZAqcEcu8QLos6yqXAG3BU8LYxX5eI1HWuiDvJBw/edit?usp=sharing
There are several other examples like that, not limited to intron retention, and for several analysis (on completely different samples).
I don't know if the aligner might have something to do with it, but as far as I remember, I have used the same version of STAR, before and after the Kiss2refgenome update.
EDIT : so the culprit was a one base variation. First example, outside of the intron, second example inside the intron. In the end, those events really are duplicates. It is still strange that this type of variation didn't appear before the update. On a 2000 differentially expressed events list, there is something like 150, maybe 200 of thoose "duplicated" events. (With a quick glance, same thing for my other anayses).
EDIT : Maybe I should have mentionned that this analysis was only done with the type_1 file of KisSplice, so it only concern splicing events.
Thanks for your help !
Hi,
thanks for the answer ! So to be clear, there is no reality behind thoose events ? Because I have another case in mind where 2 events seems to be exactly the same, except for the junction. There was a one base difference exactly on the junction site. Shifting it from a canonical, to a none canonical site.
Thanks !
Hi David,
Short answer is, these events are real as they are supported by reads, but most of the time we should merge them together.
The redundancy problem comes from a particular and key structure of the deBruijn Graph : the bubble. KisSplice is optimised to find such structure because each splicing event will create a bubble in the deBruijn Graph. BUT, not all bubble describe a splicing event. SNV, InDel, inexact repeats , among other, also creat a bubble in a deBruijn Graph. Now, let's say that we have an Intron Retention event (1 bubble), but the retained intron exist in two forms : with or without e deletion. This will create a bubble inside the previous bubble. As a result, KisSplice will output ALL POSSIBLE BUBBLES : spliced intron + retained intron without deletion AND spliced intron + retained intron with deletion. And we have a "duplicated" event. The point is, if one is interested in splicing event, this deletion does not carry useful information.
The main issue is that redundant bubbles will create problem during the quantification step as reads will be multimapped between redundant bubbles (except for the reads with or without the indel in our example, which are the only one to decipher between the two bubbles), and we will end up loosing statistical power by splitting our reads among redundant bubbles.
We are currently working on KisSplice to integrate the redundancy removal, among other major performance and accuracy ameliorations during the quantification step. So, in the near future, KisSplice (and not KisSplice2RefGenome) will merge the redundant bubbles.
I hope this was clear enough... Do not hesitate to ask us any questions, we'll be glad to answer :)
Have a nice day!
Hi,
It was very clear. It is good to see those new developments for KisSplice !
Thanks for your help and have a nice day !
Hi to anyone that could be interested!
Due to technical problems, we did not add this feature in KisSplice... yet! But we added it to the latest version of KisSplice2RefGenome, that you can find here: click me
The way this work is not very satisfying: we only keep the first event of any number of duplicated events. We will improve on this duplication problem in new versions of kissplice!
Audric
Hi,
thank you for the update. However I can't install it.
For KisSplice2RefGenome 2.0.0, I installed it simply with python3 and it worked perfectly :
However, for 2.0.1, I get :
Something about Python3 vs Python2 I guess ? (I am more familiar with Perl than Python to be honest !). As it was not in the setup.py for the version 2.0.0, I removed the ", e:". Of course it doesn't work, but it's making it go further, and stall at :
Trying to install with python2 doesn't work either.
Thanks for your help !
Hello,
Oh yeah, the files were mixed up... This should work now! Thanks for the report :)
Hello,
thanks, it does work now. I also noticed that for the frameshift column, everything have been shifted. The true are now false and vice versa. It is far more logical like that !
Thanks again !