Without bed file (only used hg38.fa). Results:

Question

why does qualimap2 shows different results for number of mismatches for bowtie2 and BWAMEM?

0

Entering edit mode

5.7 years ago

maria2019 ▴ 250

Hi, I am very new to NGS analysis. I want to do WES analysis on a normal human embryonic stem cell line. I have aligned the paired-end reads (2* 150) on the the hg38 with bowtie2 and bwamem. after the alignment and then realignment around the indel, I checked the quality with the qualimap2 and I got VERY different result for mismatches, insertion, and indels.

Mismatches and indels (inside of regions)

Bowtie2 results: General error rate 1.11% Mismatches 35,875,622 Insertions 6,729,119 Mapped reads with at least one insertion 6.51% Deletions 812,397 Mapped reads with at least one deletion1.26% Homopolymer indels 16.81%

BWAmem result: General error rate 0.27% Mismatches 20,648,892 Insertions 276,090 Mapped reads with at least one insertion 0.33% Deletions 332,003 Mapped reads with at least one deletion 0.4% Homopolymer indels 59.3%

My questions are: 1. This is a normal cell line (not cancerous)! Should I expect this many mismatches? 2. Should I expect such a difference between the bowtie2 and bwa mem results?

WES qualimap bowtie2 bwamem • 2.4k views

ADD COMMENT • link updated 5.7 years ago by h.mon 35k • written 5.7 years ago by maria2019 ▴ 250

0

Entering edit mode

Changed to Question rather than Page or Tool, please leave it like that.

ADD REPLY • link 5.7 years ago by ATpoint 85k

0

Entering edit mode

You didn't say what were the commands used to map the reads (Bowtie2 in particular allows a lot of customization), nor the overall mappings metrics (number of mapped reads, uniquely and multiply mapped reads, and so on). You also didn't say which sequencing technology did you use. Finally, if you are doing quality control of exome sequencing, it is very helpful to pass to Qualimap2 a bed file with the probes capture regions, so it can provide inside- and outside-probe mapping metrics.

ADD REPLY • link 5.7 years ago by h.mon 35k

0

Entering edit mode

Hi, thanks for your response. Below are the commands I used:

Trimming with cutadapt reports: Encoding: Sanger . Illumina 1.9, Tot sequences: 56558516, GC%: 49%
alignment:

2.1. bwamem:

$ bwa index genome.fasta genome

$ bwa mem -M -t 70 hg38.fa trimmed.R1.fastq trimmed.R2.fastq | samtools sort -@70 -o WES_bwamem_sorted.bam

$ samtools view WES_bwamem_sorted.bam |wc -l ---> result =113413327

2.2. bowtie2

$ bowtie2-build genome.fa hg38

$ ./bowtie2 -t -q -p 70 -x ~/indexes/h38 -1 trimmed.R1.fastq -S trimmed.R2.fastq -S WES_bowtie2.sam

$ samtools sort WES_bowtie2.sam -o WES_bowtie2_sorted.bam (samtools 1.9)

$ samtools view WES_bowti2_sorted.bam |wc -l --> result =113116752

The rest is the same for both:

Add tags with picard: $ java -jar picard.jar AddOrReplaceReadGroups I=WES_bwamem_sorted.bam O=WES_bwamem_sorted_rep.bam RGID=HT52.6 RGLB=lib1 RGPL=illumina RGPU=WBBXX.6 RGSM=K00281 VALIDATION_STRINGENCY=LENIENT
Remove duplicates with picard: $ java -jar picard.jar MarkDuplicates INPUT=WES_bwamem_sorted_rep.bam OUTPUT=dupmarked.bam METRICS_FILE=dupmarkmetrics.txt REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=LENIENT
Index bam file with samtools: $ samtools index dupmarked.bam dupmarked.bam.bai
Locate indels (GATK 3.8): $ java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R hg38.fa -I dupmarked.bam -o out.intervals
Realign around indels: ( GATK 3.8): $ java -jar GenomeAnalysisTK.jar --filter_bases_not_stored -T IndelRealigner -R hg38.fa -I dupmarked.bam -known Homo_sapiens_assembly38.known_indels.vcf -targetIntervals out.intervals -o realigned.bam
Base quality score recalibration: $ java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R .hg38.fa -I realigned.bam -knownSites latest_dbsnp.vcf -o recal_data.grp
Create recalibrated bam : $ java -jar GenomeAnalysisTK.jar -T PrintReads -R hg38.fa -I realigned.bam -BQSR recal_data.grp -o recal.bam
Qualimap: Window #: 400, Size of a homopolymer: 3, Skip duplicate alignments: No, Analyze overlapping paired-end Reads: No, Draw chromosome limits: Yes

ADD REPLY • link 5.7 years ago by maria2019 ▴ 250

0

Entering edit mode

I used qualimap 2 times. Once I only used hg38 without a bed file. Then I used a bed file that I got from genome browser table: (http://genome.ucsc.edu/cgi-bin/hgTables?hgta_doMainPage=Back) and chose exons option for the output as a bed file. Is that what you mean? Or I will have to create my own bed file?

Results:

My Reference size was: 3,099,922,541

Without bed file (only used hg38.fa). Results:

BWA mem

Mapped reads 81,480,705 / 99.83%

Unmapped reads 135,081 / 0.17%

Mapped paired reads 81,480,705 / 99.83%

Mapped reads, first in pair 40,751,388 / 49.93%

Mapped reads, second in pair 40,729,317 / 49.9%

Mapped reads, both in pair 81,434,763 / 99.78%

Mapped reads, singletons 45,942 / 0.06%

Read min/max/mean length 1 / 151 / 146.51

Duplicated reads (estimated) 33,016,104 / 40.45%

Duplication rate 38.84%

Clipped reads 12,393,533 / 15.19%

Coverage: Mean 3.7278 , Standard Deviation 23.8264

Mapping Quality: Mean Mapping Quality: 44.63

Insert size: Mean: 36,360.19 , Standard Deviation: 2,031,289.85 , P25/Median/P75: 146 / 188 / 241

Mismatches and indels : General error rate 0.25% Mismatches 27,060,341 Insertions 432,950 Mapped reads with at least one insertion 0.51% Deletions 486,236 Mapped reads with at least one deletion 0.58% Homopolymer indels 57.47%

Bowtie2

Mapped reads 74,496,919 / 91.99%

Unmapped reads 6,489,224 / 8.01%

Mapped paired reads 74,496,789 / 91.99%

Mapped reads, first in pair 38,656,666 / 47.73%

Mapped reads, second in pair 35,840,123 / 44.25%

Mapped reads, both in pair 71,519,982 / 88.31%

Mapped reads, singletons 2,976,807 / 3.68%

Read min/max/mean length 1 / 151 / 147.89

Duplicated reads (estimated) 26,008,479 / 32.11%

Duplication rate 32.97%

Clipped reads 0 / 0%

Coverage: Mean 3.5345, Standard Deviation 21.7436

Mapping Quality: Mean Mapping Quality 27.72

Insert size: Mean 12,726.21, Standard Deviation 1,057,876.32, P25/Median/P75 161 / 199 / 249

Mismatches and indels General error rate 0.91% Mismatches 55,590,848 Insertions 9,694,304 Mapped reads with at least one insertion 7.49% Deletions 1,235,926 Mapped reads with at least one deletion 1.52% Homopolymer indels 17.53%

With bed file

(NUMEBR OF MISMATCHES IS IN THE QUESTION)

BWA mem

Warnings: Some regions are not loaded, 114573 regions were skipped because chromosome name was not found in the BAM file.

Mapped reads 81,480,705 / 99.83%

Unmapped reads 135,081 / 0.17%

Mapped paired reads 81,480,705 / 99.83%

Mapped reads, first in pair 40,751,388 / 49.93%

Mapped reads, second in pair 40,729,317 / 49.9%

Mapped reads, both in pair 81,434,763 / 99.78%

Mapped reads, singletons 45,942 / 0.06%

Read min/max/mean length 1 / 151 / 146.51

Clipped reads 10,174,547 / 12.47%

2.3. Globals (inside of regions):

Regions size/percentage of reference 135,021,462 / 4.36%

Mapped reads 69,574,272 / 85.25%

Mapped reads, only first in pair 34,828,403 / 42.67%

Mapped reads, only second in pair 34,745,869 / 42.57%

Mapped reads, both in pair 69,553,031 / 85.22%

Mapped reads, singletons 21,241 / 0.03%

Correct strand reads 0 / 0%

Clipped reads 10,174,547 / 12.47%

Duplicated reads (estimated) 30,358,081 / 43.63%

Duplication rate 43.23%

Coverage (inside of regions), Mean 58.958, Standard Deviation 82.5376

Mapping Quality (inside of regions): Mean Mapping Quality: 58.09

Insert size (inside of regions): Mean 11,841.97, Standard Deviation 1,061,230.03, P25/Median/P75: 146 / 186 / 237

Bowtie2

Warnings Some regions are not loaded 114573 regions were skipped because chromosome name was not found in the BAM file.

reads 64,103,566 Mapped reads 59,192,134 / 92.34%

Unmapped reads 4,911,432 / 7.66%

Mapped paired reads 59,192,037 / 92.34%

Mapped reads, first in pair 30,715,233 / 47.92%

Mapped reads, second in pair 28,476,804 / 44.42%

Mapped reads, both in pair 56,830,127 / 88.65%

Mapped reads, singletons 2,361,910 / 3.68%

Read min/max/mean length 1 / 151 / 147.89

Clipped reads 0 / 0%

2.3. Globals (inside of regions)

Regions size/percentage of reference 135,021,462 / 4.36%

Mapped reads 50,853,962 / 79.33%

Mapped reads, only first in pair 26,307,095 / 41.04%

Mapped reads, only second in pair 24,546,782 / 38.29%

Mapped reads, both in pair 49,046,814 / 76.51%

Mapped reads, singletons 1,807,063 / 2.82%

Correct strand reads 0 / 0%

Clipped reads 0 / 0%

Duplicated reads (estimated) 19,437,562 / 38.22%

Duplication rate 37.57%

2.5. Coverage (inside of regions): Mean 44.4131, Standard Deviation 72.5277

2.6. Mapping Quality (inside of regions): Mean Mapping Quality: 37.84

2.7. Insert size (inside of regions)

Mean 8,897.63

Standard Deviation 858,716.96

P25/Median/P75 160 / 196 / 245

ADD REPLY • link 5.7 years ago by maria2019 ▴ 250

score 0 · Answer 1 · 2019-03-27

This is a normal cell line (not cancerous)! Should I expect this many mismatches?

You didn't perform "proper" variant calling, you are just looking at all variants, be them true variants or sequencing errors. For an expected range of true variants numbers, see the answers at What is the expected rate of SNPs within a human exome sequencing with HiSeq? .

Should I expect such a difference between the bowtie2 and bwa mem results?

By default, bowtie2 performs end-to-end alignment, while bwa performs local alignment. This means bowtie2 aligns the whole read, even if some regions don't align well, while bwa will soft-clip the extremities of the read, if this increases the alignment score. You can see this clearly at the clipped reads metrics: bowtie2 has 0%, while bwa has a sizeable proportion of clipped reads. I believe the differences in mismatches, insertions and deletions are mostly due to this, and I think the difference would be a lot smaller if you were to perform variant calling and apply quality filters after variant calling.

The mapping rate and mapping rate inside regions are better for bwa, and for this reason I think you should stay with bwa alignment.

I used a bed file that I got from genome browser table: (http://genome.ucsc.edu/cgi-bin/hgTables?hgta_doMainPage=Back) and chose exons option for the output as a bed file. Is that what you mean?

No, the manufacturer of the exome kit probably has released a bed file with the coordinates of the capture probes, I was suggesting you use this bed.