Hello there,
thanks for your time,
I'm dealing with complete human genome data where I'm majorly focusing on repeat regions, but surprisingly i see so many repeat reads falling under supplementary reads ( chimeric ) this is the case in BWA & Dragen, but bowtie2 is dropping those locations where there is case of supplementary reads, so majority of repeat reads are dropping by bowtie2 and majority of reads are getting aligned by BWA & Dragen but most them are chimeric.
I wanted to spend good enough time to understand aligners mechanism while its is dealing with repeat reads, i'm just seeing the overall counts from this 3 aligners. but to pin point to repeat locations specific alignment, i want to choose best possible aligner among this 3 or any other aligner which gives good results. but what type of info will tell me that a particular aligner is doing good while aligning repeat reads.
How can i know accuracy of it, how can i validate whether the aligned location is right or not, and how to deal with chimeric reads, how to understand why aligner is generating so many chimeric reads.
I'm sorry if i'm confusing you, but mainly i'm very curious to understand about chimeric reads in different aligners results. how can we deal with chimeric reads when we dont want to drop those reads.
thanks you so much once again, i'm open for discussion anytime. i really appreciate your help.
best regards.
chimeric reads are an odd beast, there is no "clear definition" on what a chimera is.
definition of chimeric vs multiple-mapping (SAM)
the way I understand it for an alignment to be "chimeric" it has to be a non-linear alignment (not just skipping like an intron in the middle), can be only be formed when the sequence aligns with exactly two targets, and both alignments must match around the junction (though may be clipped at the end)
what is important that "chimeric" does not actually mean the sequence was a chimera - it can just as well be an artifact of the alignment process
I don't believe this is necessarily true. I tend to think of a "supplementary reads" as "split alignments", and if you are aligning a long read or a contig against a reference genome, it can be split arbitrarily many times. The SA tag in the SAM/BAM/CRAM also tells you where the pieces of the split are (the SA tag is stored on all the pieces of the split)
see also this recent reddit thread https://www.reddit.com/r/bioinformatics/comments/sdicpj/chimeric_reads/
that the term "chimeric" does not have a precise definition - it is what the designer of the algorithm considers "chimeric" - and that definition may differ.
I believe bwa mem will only produce chimeric reads over two sequences (I expect minimap2 to behave similarly as well, though I have not investigated)
all the other multi-split alignments would be called secondary alignments in my opinion, though I would like to settle that in my mind as well
If I have a read that split-aligns to chr1, chr4, and chrX, then this produces one primary alignment record, and two supplementary alignments marked with the 2048 flag (not secondary!). The SA tag is written to all three of these records, and I can reconstitute the full SV event from just one part of the split alignment.
See SAMv1.pdf https://samtools.github.io/hts-specs/SAMv1.pdf and the "Chimeric alignment" definition. It says nothing about only two sequences, and says specifically that it is supplementary and not secondary.
You are correct, all segments are indeed reported as such, not sure why in my mind I always associated it with two fragments only. I'm going to post the test code I ran for reference
the results is:
indeed two supplementary alignments are reported with corresponding SA tags.