Question

Understanding Supplementary reads.

1

Entering edit mode

2.8 years ago

kiran ▴ 10

Hello there,

thanks for your time,

I'm dealing with complete human genome data where I'm majorly focusing on repeat regions, but surprisingly i see so many repeat reads falling under supplementary reads ( chimeric ) this is the case in BWA & Dragen, but bowtie2 is dropping those locations where there is case of supplementary reads, so majority of repeat reads are dropping by bowtie2 and majority of reads are getting aligned by BWA & Dragen but most them are chimeric.

I wanted to spend good enough time to understand aligners mechanism while its is dealing with repeat reads, i'm just seeing the overall counts from this 3 aligners. but to pin point to repeat locations specific alignment, i want to choose best possible aligner among this 3 or any other aligner which gives good results. but what type of info will tell me that a particular aligner is doing good while aligning repeat reads.

How can i know accuracy of it, how can i validate whether the aligned location is right or not, and how to deal with chimeric reads, how to understand why aligner is generating so many chimeric reads.

I'm sorry if i'm confusing you, but mainly i'm very curious to understand about chimeric reads in different aligners results. how can we deal with chimeric reads when we dont want to drop those reads.

thanks you so much once again, i'm open for discussion anytime. i really appreciate your help.

best regards.

Bowtie2 MEM BWA Dragen • 3.9k views

ADD COMMENT • link updated 2.8 years ago by Istvan Albert 101k • written 2.8 years ago by kiran ▴ 10

1

Entering edit mode

chimeric reads are an odd beast, there is no "clear definition" on what a chimera is.

definition of chimeric vs multiple-mapping (SAM)

the way I understand it for an alignment to be "chimeric" it has to be a non-linear alignment (not just skipping like an intron in the middle), can be only be formed when the sequence aligns with exactly two targets, and both alignments must match around the junction (though may be clipped at the end)

what is important that "chimeric" does not actually mean the sequence was a chimera - it can just as well be an artifact of the alignment process

ADD REPLY • link 2.8 years ago by Istvan Albert 101k

2

Entering edit mode

can be only be formed when the sequence aligns with exactly two targets

I don't believe this is necessarily true. I tend to think of a "supplementary reads" as "split alignments", and if you are aligning a long read or a contig against a reference genome, it can be split arbitrarily many times. The SA tag in the SAM/BAM/CRAM also tells you where the pieces of the split are (the SA tag is stored on all the pieces of the split)

ADD REPLY • link 2.8 years ago by cmdcolin ★ 4.0k

1

Entering edit mode

see also this recent reddit thread https://www.reddit.com/r/bioinformatics/comments/sdicpj/chimeric_reads/

ADD REPLY • link 2.8 years ago by cmdcolin ★ 4.0k

1

Entering edit mode

that the term "chimeric" does not have a precise definition - it is what the designer of the algorithm considers "chimeric" - and that definition may differ.

I believe bwa mem will only produce chimeric reads over two sequences (I expect minimap2 to behave similarly as well, though I have not investigated)

all the other multi-split alignments would be called secondary alignments in my opinion, though I would like to settle that in my mind as well

ADD REPLY • link 2.8 years ago by Istvan Albert 101k

2

Entering edit mode

If I have a read that split-aligns to chr1, chr4, and chrX, then this produces one primary alignment record, and two supplementary alignments marked with the 2048 flag (not secondary!). The SA tag is written to all three of these records, and I can reconstitute the full SV event from just one part of the split alignment.

See SAMv1.pdf https://samtools.github.io/hts-specs/SAMv1.pdf and the "Chimeric alignment" definition. It says nothing about only two sequences, and says specifically that it is supplementary and not secondary.

ADD REPLY • link 2.8 years ago by cmdcolin ★ 4.0k

3

Entering edit mode

You are correct, all segments are indeed reported as such, not sure why in my mind I always associated it with two fragments only. I'm going to post the test code I ran for reference

set -uex

pip install bio --upgrade

# Fetch a genome to build our chromsomes from.
bio fetch AF086833 -format fasta > AF086833.fa

# Make an articicial three chromosome genome.
bio fasta AF086833.fa -e 1000 --rename chr1 > ref.fa
bio fasta AF086833.fa -s 2000 -e 3000 --rename chr2 >> ref.fa
bio fasta AF086833.fa -s 4000 -e 5000 --rename chr3 >> ref.fa

# Make the query (needs emboss union)
bio fasta ref.fa -end 100 | union --filter -sid seq > query.fa

# Build index
bwa index ref.fa

# Generate SAM file
bwa mem ref.fa query.fa > align.sam

# Print the relevant fields
samtools view align.sam | cut -f 2,3,16

the results is:

0       chr2    SA:Z:chr3,1,+,200S100M,60,0;chr1,1,+,100M200S,60,0;
2048    chr3    SA:Z:chr2,1,+,100S100M100S,60,0;chr1,1,+,100M200S,60,0;
2048    chr1    SA:Z:chr2,1,+,100S100M100S,60,0;chr3,1,+,200S100M,60,0;

indeed two supplementary alignments are reported with corresponding SA tags.

ADD REPLY • link 2.8 years ago by Istvan Albert 101k