I have a question regarding unmapped reads. From SRBreak paper: "If reads are aligned across breakpoints then some parts of them cannot be mapped the first time. These parts are denoted by the āSā character in the CIGAR strings of these reads". 'S' shows Soft Clipping; the clipped nucleotides are present in the read. I can find the number of 'S' character in Cigar. Does anybody know how can I use split reads and align them to a reference genome again?
duplicate of extracting the soft clipped seq only from a sam file
I interpreted this as a slightly different question as the other question didn't cover the additional steps required to turn the soft clipped reads + alignments into a split read. You need to:
match the fragments back to their reads
drop unmapped fragments - these reads stay as soft clipped reads
rehydrate the sequence and quality scores of the originating read (or write a hard clip)
replace all the SAM flags, fields and tags with that of original soft clipped read except the alignment-specific ones such as RNAME, POS, CIGAR, and NM tag
set supplementary flag
write SA tags
merge the new supplementary reads back into the input file in their mapped position (they were extracted according to the position of the primary soft clipped alignment)