Hello all.
I am trying to align some WGS reads to a reference genome using bwa-mem and bowtie2. The goal is to align 50 samples, however to do some checks I have selected only one sample and am running some tests on it. After performing all the quality control steps and checking that the sequences seem to be OK, I have aligned them with both bwa-mem and bowtie2 using the following commands:
# FOR BWA-MEM:
# Build the index for the reference
bwa index $ref
# Align reads to reference
bwa mem -t 10 $ref $r1.fq.gz $r2.fq.gz | samtools sort -@ 10 -m 4G > bwa.bam
# FOR BOWTIE2:
# Build the index for the reference
bowtie2-build $ref $ref
# Align reads to reference
bowtie2 -p 4 --no-unal -D 20 -R 3 -N 1 -L 20 -i S,1,0.5 --maxins 350 --minins 0 -x $ref -1 $r1.fq.gz -2 $r2.fq.gz | samtools sort -@ 4 -m 5G > bowtie.bam
The samtools flagstat report for each bam:
# BWA-mem
73066307 + 0 in total (QC-passed reads + QC-failed reads)
71520066 + 0 primary
0 + 0 secondary
1546241 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
72341933 + 0 mapped (99.01% : N/A)
70795692 + 0 primary mapped (98.99% : N/A)
71520066 + 0 paired in sequencing
35760033 + 0 read1
35760033 + 0 read2
61642364 + 0 properly paired (86.19% : N/A)
70543774 + 0 with itself and mate mapped
251918 + 0 singletons (0.35% : N/A)
7500328 + 0 with mate mapped to a different chr
4271286 + 0 with mate mapped to a different chr (mapQ>=5)
# Bowtie2
53628504 + 0 in total (QC-passed reads + QC-failed reads)
53628504 + 0 primary
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
53628504 + 0 mapped (100.00% : N/A)
53628504 + 0 primary mapped (100.00% : N/A)
53628504 + 0 paired in sequencing
26862051 + 0 read1
26766453 + 0 read2
28009618 + 0 properly paired (52.23% : N/A)
50436738 + 0 with itself and mate mapped
3191766 + 0 singletons (5.95% : N/A)
10168420 + 0 with mate mapped to a different chr
1949385 + 0 with mate mapped to a different chr (mapQ>=5)
After obtaining the alignments, I have compared them in IGV, however I get some very strange results. There are certain regions in the genome where nothing aligns and others where the coverage is excessive (up to 2000x coverage according to IGV in some regions).
Something seems to be wrong, especially because the sequencing coverage was 15x, so the total coverage should not be as higher as 2000x in some regions (or ~500 as shown in the first screenshot) Any idea what could be the cause of this behavior? Thanks in advance.
Thanks! I was suspecting this as well although I thought maybe there was some glitch in the alignment because in others I have done in Ecoli (with more coverage) I had not observed these results in the IGV. I guess this difference can be due to working with a bacterial model organism and a non-model plant.
I am neither a microbiology- nor a plant person but I would assume that simply due to genome size and annotation status the plant should have more unequal coverage because a) eukaryote genomes are more complex, b) that plant is probably not fully annotated and c) regions like telomers, centromers and repeat stretches complicate the matter.