Question

Strange alignments results

0

Entering edit mode

2.4 years ago

antmantras ▴ 80

Hello all.

I am trying to align some WGS reads to a reference genome using bwa-mem and bowtie2. The goal is to align 50 samples, however to do some checks I have selected only one sample and am running some tests on it. After performing all the quality control steps and checking that the sequences seem to be OK, I have aligned them with both bwa-mem and bowtie2 using the following commands:

# FOR BWA-MEM:

# Build the index for the reference

bwa index $ref

# Align reads to reference

bwa mem -t 10 $ref $r1.fq.gz $r2.fq.gz | samtools sort -@ 10 -m 4G > bwa.bam

# FOR BOWTIE2:

# Build the index for the reference

bowtie2-build $ref $ref

# Align reads to reference

bowtie2 -p 4 --no-unal -D 20 -R 3 -N 1 -L 20 -i S,1,0.5 --maxins 350 --minins 0 -x $ref -1 $r1.fq.gz -2 $r2.fq.gz | samtools sort -@ 4 -m 5G > bowtie.bam

The samtools flagstat report for each bam:

# BWA-mem
73066307 + 0 in total (QC-passed reads + QC-failed reads)
71520066 + 0 primary
0 + 0 secondary
1546241 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
72341933 + 0 mapped (99.01% : N/A)
70795692 + 0 primary mapped (98.99% : N/A)
71520066 + 0 paired in sequencing
35760033 + 0 read1
35760033 + 0 read2
61642364 + 0 properly paired (86.19% : N/A)
70543774 + 0 with itself and mate mapped
251918 + 0 singletons (0.35% : N/A)
7500328 + 0 with mate mapped to a different chr
4271286 + 0 with mate mapped to a different chr (mapQ>=5)

# Bowtie2
53628504 + 0 in total (QC-passed reads + QC-failed reads)
53628504 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
53628504 + 0 mapped (100.00% : N/A)
53628504 + 0 primary mapped (100.00% : N/A)
53628504 + 0 paired in sequencing
26862051 + 0 read1
26766453 + 0 read2
28009618 + 0 properly paired (52.23% : N/A)
50436738 + 0 with itself and mate mapped
3191766 + 0 singletons (5.95% : N/A)
10168420 + 0 with mate mapped to a different chr
1949385 + 0 with mate mapped to a different chr (mapQ>=5)

After obtaining the alignments, I have compared them in IGV, however I get some very strange results. There are certain regions in the genome where nothing aligns and others where the coverage is excessive (up to 2000x coverage according to IGV in some regions).

First screenshot Second screenshot

Something seems to be wrong, especially because the sequencing coverage was 15x, so the total coverage should not be as higher as 2000x in some regions (or ~500 as shown in the first screenshot) Any idea what could be the cause of this behavior? Thanks in advance.

coverage mem bwa bowtie2 alignment • 724 views

ADD COMMENT • link updated 2.4 years ago by ATpoint 85k • written 2.4 years ago by antmantras ▴ 80

score 5 · Accepted Answer · 2022-06-20

5

Entering edit mode

2.4 years ago

ATpoint 85k

This is normal and expected. Some regions in the genome are not or poorly mappable with short NGS reads while others will accumulate excessive read counts, e.g. due to low complexity, by this prone for false alignments. It is also expected that with these settings bwa aligns more reads than bowtie2 because bwa by default does local while bowtie2 does end-toend alignment. Make your life easy and use bwa mem because this is what everyone uses for WGS and it is what many variant callers and downstream tools expect. In a nutshell, the average genome coverage has gigantic standard errors, it is nowhere near a Gaussian.

ADD COMMENT • link 2.4 years ago by ATpoint 85k

0

Entering edit mode

Thanks! I was suspecting this as well although I thought maybe there was some glitch in the alignment because in others I have done in Ecoli (with more coverage) I had not observed these results in the IGV. I guess this difference can be due to working with a bacterial model organism and a non-model plant.

ADD REPLY • link 2.4 years ago by antmantras ▴ 80

0

Entering edit mode

I am neither a microbiology- nor a plant person but I would assume that simply due to genome size and annotation status the plant should have more unequal coverage because a) eukaryote genomes are more complex, b) that plant is probably not fully annotated and c) regions like telomers, centromers and repeat stretches complicate the matter.

ADD REPLY • link 2.4 years ago by ATpoint 85k