I'm working with human RNA-Seq data and I'm observing some, in my eyes, weird behavior for certain splice junctions. For context, the data consists of 50nt single-end reads and was mapped using STAR against the human genome with GENCODE V34 annotation.
For some junctions, I'm seeing some "weird" read splitting behavior, where the majority of the split-reads have a short overhang on one side, and on one side only. For example, the mean overhang length on the left side is 5nt and 45nt on the right side. Assuming that reads are more or less randomly distributed across a gene, what could be the reason that for these junctions the split reads all start and split at the same position? Might this be an artifact? Again, assuming a uniform read distribution, I would expect the overhang distribution for a junction to be roughly equal on both sides.
Has anyone observed a similar kind of behaviour before and could tell me more what I'm obversing here?
Thanks in advance!
can you provide an IGV screenshot of this?
The IGV screenshots do not give an as clear picture as I observe when examining the CIGAR signatures for the split reads. Nonetheless, here are three screenshots which hopefully can aid a bit. I listed the mean overhang length, based on the CIGAR signatures, of the left and right overhangs below each screenshot.
chr1:9730526-9730621 - Left v right mean overhang: 42.56 v 5.39
chr1:22660893-22660946 - Left v right mean overhang: 43.93 v 4.07
chr1:35192686-35192765 - Left v right mean overhang: 4.40 v 43.35
thanks for uploading this. I must say I am stumped. It does look like the overhang alignments are stopping at a particular base pair position as though there is an N in the reference that is causing a soft clip.