Hi, I am very new to NGS analysis. I want to do WES analysis on a normal human embryonic stem cell line. I have aligned the paired-end reads (2* 150) on the the hg38 with bowtie2 and bwamem. after the alignment and then realignment around the indel, I checked the quality with the qualimap2 and I got VERY different result for mismatches, insertion, and indels.
Mismatches and indels (inside of regions)
Bowtie2 results: General error rate 1.11% Mismatches 35,875,622 Insertions 6,729,119 Mapped reads with at least one insertion 6.51% Deletions 812,397 Mapped reads with at least one deletion1.26% Homopolymer indels 16.81%
BWAmem result: General error rate 0.27% Mismatches 20,648,892 Insertions 276,090 Mapped reads with at least one insertion 0.33% Deletions 332,003 Mapped reads with at least one deletion 0.4% Homopolymer indels 59.3%
My questions are: 1. This is a normal cell line (not cancerous)! Should I expect this many mismatches? 2. Should I expect such a difference between the bowtie2 and bwa mem results?
Changed to
Question
rather thanPage
orTool
, please leave it like that.You didn't say what were the commands used to map the reads (Bowtie2 in particular allows a lot of customization), nor the overall mappings metrics (number of mapped reads, uniquely and multiply mapped reads, and so on). You also didn't say which sequencing technology did you use. Finally, if you are doing quality control of exome sequencing, it is very helpful to pass to Qualimap2 a bed file with the probes capture regions, so it can provide inside- and outside-probe mapping metrics.
Hi, thanks for your response. Below are the commands I used:
2.1. bwamem:
$ bwa index genome.fasta genome
$ bwa mem -M -t 70 hg38.fa trimmed.R1.fastq trimmed.R2.fastq | samtools sort -@70 -o WES_bwamem_sorted.bam
$ samtools view WES_bwamem_sorted.bam |wc -l ---> result =113413327
2.2. bowtie2
$ bowtie2-build genome.fa hg38
$ ./bowtie2 -t -q -p 70 -x ~/indexes/h38 -1 trimmed.R1.fastq -S trimmed.R2.fastq -S WES_bowtie2.sam
$ samtools sort WES_bowtie2.sam -o WES_bowtie2_sorted.bam (samtools 1.9)
$ samtools view WES_bowti2_sorted.bam |wc -l --> result =113116752
The rest is the same for both:
Add tags with picard: $ java -jar picard.jar AddOrReplaceReadGroups I=WES_bwamem_sorted.bam O=WES_bwamem_sorted_rep.bam RGID=HT52.6 RGLB=lib1 RGPL=illumina RGPU=WBBXX.6 RGSM=K00281 VALIDATION_STRINGENCY=LENIENT
Remove duplicates with picard: $ java -jar picard.jar MarkDuplicates INPUT=WES_bwamem_sorted_rep.bam OUTPUT=dupmarked.bam METRICS_FILE=dupmarkmetrics.txt REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=LENIENT
Index bam file with samtools: $ samtools index dupmarked.bam dupmarked.bam.bai
Locate indels (GATK 3.8): $ java -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R hg38.fa -I dupmarked.bam -o out.intervals
Realign around indels: ( GATK 3.8): $ java -jar GenomeAnalysisTK.jar --filter_bases_not_stored -T IndelRealigner -R hg38.fa -I dupmarked.bam -known Homo_sapiens_assembly38.known_indels.vcf -targetIntervals out.intervals -o realigned.bam
Base quality score recalibration: $ java -jar GenomeAnalysisTK.jar -T BaseRecalibrator -R .hg38.fa -I realigned.bam -knownSites latest_dbsnp.vcf -o recal_data.grp
Create recalibrated bam : $ java -jar GenomeAnalysisTK.jar -T PrintReads -R hg38.fa -I realigned.bam -BQSR recal_data.grp -o recal.bam
Qualimap: Window #: 400, Size of a homopolymer: 3, Skip duplicate alignments: No, Analyze overlapping paired-end Reads: No, Draw chromosome limits: Yes
I used qualimap 2 times. Once I only used hg38 without a bed file. Then I used a bed file that I got from genome browser table: (http://genome.ucsc.edu/cgi-bin/hgTables?hgta_doMainPage=Back) and chose exons option for the output as a bed file. Is that what you mean? Or I will have to create my own bed file?
Results:
My Reference size was: 3,099,922,541
Without bed file (only used hg38.fa). Results:
BWA mem
reads 81,615,786
Mapped reads 81,480,705 / 99.83%
Unmapped reads 135,081 / 0.17%
Mapped paired reads 81,480,705 / 99.83%
Mapped reads, first in pair 40,751,388 / 49.93%
Mapped reads, second in pair 40,729,317 / 49.9%
Mapped reads, both in pair 81,434,763 / 99.78%
Mapped reads, singletons 45,942 / 0.06%
Read min/max/mean length 1 / 151 / 146.51
Duplicated reads (estimated) 33,016,104 / 40.45%
Duplication rate 38.84%
Clipped reads 12,393,533 / 15.19%
Coverage: Mean 3.7278 , Standard Deviation 23.8264
Mapping Quality: Mean Mapping Quality: 44.63
Insert size: Mean: 36,360.19 , Standard Deviation: 2,031,289.85 , P25/Median/P75: 146 / 188 / 241
Mismatches and indels : General error rate 0.25% Mismatches 27,060,341 Insertions 432,950 Mapped reads with at least one insertion 0.51% Deletions 486,236 Mapped reads with at least one deletion 0.58% Homopolymer indels 57.47%
Bowtie2
reads 80,986,143
Mapped reads 74,496,919 / 91.99%
Unmapped reads 6,489,224 / 8.01%
Mapped paired reads 74,496,789 / 91.99%
Mapped reads, first in pair 38,656,666 / 47.73%
Mapped reads, second in pair 35,840,123 / 44.25%
Mapped reads, both in pair 71,519,982 / 88.31%
Mapped reads, singletons 2,976,807 / 3.68%
Read min/max/mean length 1 / 151 / 147.89
Duplicated reads (estimated) 26,008,479 / 32.11%
Duplication rate 32.97%
Clipped reads 0 / 0%
Coverage: Mean 3.5345, Standard Deviation 21.7436
Mapping Quality: Mean Mapping Quality 27.72
Insert size: Mean 12,726.21, Standard Deviation 1,057,876.32, P25/Median/P75 161 / 199 / 249
Mismatches and indels General error rate 0.91% Mismatches 55,590,848 Insertions 9,694,304 Mapped reads with at least one insertion 7.49% Deletions 1,235,926 Mapped reads with at least one deletion 1.52% Homopolymer indels 17.53%
With bed file
(NUMEBR OF MISMATCHES IS IN THE QUESTION)
BWA mem
Warnings: Some regions are not loaded, 114573 regions were skipped because chromosome name was not found in the BAM file.
reads 81,615,786
Mapped reads 81,480,705 / 99.83%
Unmapped reads 135,081 / 0.17%
Mapped paired reads 81,480,705 / 99.83%
Mapped reads, first in pair 40,751,388 / 49.93%
Mapped reads, second in pair 40,729,317 / 49.9%
Mapped reads, both in pair 81,434,763 / 99.78%
Mapped reads, singletons 45,942 / 0.06%
Read min/max/mean length 1 / 151 / 146.51
Clipped reads 10,174,547 / 12.47%
2.3. Globals (inside of regions):
Regions size/percentage of reference 135,021,462 / 4.36%
Mapped reads 69,574,272 / 85.25%
Mapped reads, only first in pair 34,828,403 / 42.67%
Mapped reads, only second in pair 34,745,869 / 42.57%
Mapped reads, both in pair 69,553,031 / 85.22%
Mapped reads, singletons 21,241 / 0.03%
Correct strand reads 0 / 0%
Clipped reads 10,174,547 / 12.47%
Duplicated reads (estimated) 30,358,081 / 43.63%
Duplication rate 43.23%
Coverage (inside of regions), Mean 58.958, Standard Deviation 82.5376
Mapping Quality (inside of regions): Mean Mapping Quality: 58.09
Insert size (inside of regions): Mean 11,841.97, Standard Deviation 1,061,230.03, P25/Median/P75: 146 / 186 / 237
Bowtie2
Warnings Some regions are not loaded 114573 regions were skipped because chromosome name was not found in the BAM file.
reads 64,103,566 Mapped reads 59,192,134 / 92.34%
Unmapped reads 4,911,432 / 7.66%
Mapped paired reads 59,192,037 / 92.34%
Mapped reads, first in pair 30,715,233 / 47.92%
Mapped reads, second in pair 28,476,804 / 44.42%
Mapped reads, both in pair 56,830,127 / 88.65%
Mapped reads, singletons 2,361,910 / 3.68%
Read min/max/mean length 1 / 151 / 147.89
Clipped reads 0 / 0%
2.3. Globals (inside of regions)
Regions size/percentage of reference 135,021,462 / 4.36%
Mapped reads 50,853,962 / 79.33%
Mapped reads, only first in pair 26,307,095 / 41.04%
Mapped reads, only second in pair 24,546,782 / 38.29%
Mapped reads, both in pair 49,046,814 / 76.51%
Mapped reads, singletons 1,807,063 / 2.82%
Correct strand reads 0 / 0%
Clipped reads 0 / 0%
Duplicated reads (estimated) 19,437,562 / 38.22%
Duplication rate 37.57%
2.5. Coverage (inside of regions): Mean 44.4131, Standard Deviation 72.5277
2.6. Mapping Quality (inside of regions): Mean Mapping Quality: 37.84
2.7. Insert size (inside of regions)
Mean 8,897.63
Standard Deviation 858,716.96
P25/Median/P75 160 / 196 / 245