Hi,
I try to detect chimeric fusions between a virus (integrated into the host genome) and the host genome. I aligned the reads (paired-end 2x76 stranded) on a hybrid genome (host+virus) where the virus genome is considered as an additional chromosome. Here's my command :
$STAR --genomeDir $stargenomeDir --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3 -seedSearchStartLmax 10 --outFilterMultimapNmax 10 --outFilterMismatchNmax 10 --chimSegmentMin 10 --outFilterMatchNmin 10 --chimJunctionOverhangMin 10 --readFilesIn $r1 $r2 --runThreadN $threads --outStd SAM --readFilesCommand zcat
version : STAR_2.3.1u_r375
So I expect STAR to report fusion reads with minimum 10 bases aligning either of the host or virus ; and the rest on the virus or the host respectivelly. As :
# : host genome
@ : virus genome
= : read
- : splicing
#######################################@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
==========-----------------------------------=========================
min 10bp
But when I check on IGV on the extremity of virus genome I observe some reads containing soft-clipping longer than 10bp (in this cases they are 13bp long). When I align these soft-clipped bases on the host genome using blast I found a position on the host genome where I indeed found traces of a fusion transcript which continues (I can clearly see reads that aligned after the fusion breakpoint representing the fusion transcript). But STAR do not report this fusion. Am I doing something wrong ?
Here's the SAM lines for a pair of reads containing the soft-clipping of interest :
NS500186:93:HYFJVBGXX:3:13402:19874:18099 163 chrVirus 1 255 16S59M = 65 128 CACATGTTTAGGTTTGTGACAATGACCATGAGCCCCAAATATCCCCCGGGGGCTTAGAGCCTCCCAGTGAAAAAC AAAAAEEEEEEEEEEEEEEEEEEEAEEEEEAEEEEEEEEEEEEEEE/EEEEEEEEEEEEAEEE<EEEEEEEEEEE NH:i:1 HI:i:1 AS:i:117 nM:i:2
NS500186:93:HYFJVBGXX:3:13402:19874:18099 83 chrVirus 65 255 64M12S = 1 -128 CGCGAAACAGAAGTCTGAAAAGGTCAGGGCCCAGACCAAGGCTCTGACGTCTCCCCCCGGAGGGACAGCTCAGCAC AAEEEEEEEEEEEEEAEEEEEEE/EEEEEEEAEEEAEEEEE/EEEEEEEAEEEEEEEEEEEEEEEEEAEAEAAAAA NH:i:1 HI:i:1 AS:i:117 nM:i:2
I put two figures explaining my cases
Alignment on the virus :
Alignment on the host :
Thanks
I think I found the issue. The soft-clipped sequence appears to be present at multiple position in the host genome. I suppose STAR do not report multi-mapping fusions. I'll dig a little bit more to be sure.
@NicoBxl: Both of the links you included for the images are not loading (I tested them). Can you post new versions and update your post?
Should work. The links are ok. I also put the SAM lines for a pair of reads.
Perhaps the links are working from your part of the world but they still don't work for me. Referring to this link for example.
I reupload the picture and change the post. You see them now ?
I still can't (even if I use the URL) but don't worry. It may be local firewall specific. Other's able to see the images?
I see images. Even when sober.
I see them even before coffee :P
@NicoBxl: You might post this to the STAR email list/google group. Alex is usually pretty good about replying (given how many options STAR has, I expect he's the only one that can point to the right one to tweak).
Yes I will do that. I will post his answer here.
If you would like to try another tool I wrote RILseq which was built for chimera detection in bacteria but I don't see why it wouldn't work here. It overcomes multiple mapping reads and will do some statistical analysis to detect over-represented chimera. See https://pypi.python.org/pypi/RILseq
Did you try STAR-fusion?