I have four samples of whole genome sequence from potato leaves that were run on a single lane of a Illumina Hiseq 4000 flow cell to generate 150 PE reads. I trimmed the reads with Trimmomatic
before proceeding to align to the potato reference with bwamem
using the code below:
bwa mem -t 18 -k 16 -M -R"@RG\tID:Lane3_R\tSM:Resistant\tPL:Illumina\tLB:Resistant" potato_dm_v404_all_pm_un.fasta Resistant_Filtered_2P.fastq Resistant_Filtered_1P.fastq | samtools view -Sub - | samtools sort -O BAM -o Resistant.sorted.bam
Upon running CollectAlignmentSummaryMetrics
java -Xmx20g -jar /opt/software/picardTools/1.113/CollectAlignmentSummaryMetrics.jar R=potato_dm_v404_all_pm_un.fasta INPUT=Resistant.sorted.bam OUTPUT=Resistant_algn_summary.txt
Although 98.4% of PF Pair Reads aligned to the reference, I ascertained that for one of my samples the FIRST_OF_PAIR has substantially more PF_HQ_ALIGNED_READS than TOTAL_READS as well as other peculiar values potentially indicative of misaligned PE reads. The results are below:
FIRST_OF_PAIR TOTAL_READS 70524781
FIRST_OF_PAIR PF_READS 70524781
FIRST_OF_PAIR PCT_PF_READS 1
FIRST_OF_PAIR PF_NOISE_READS 0
FIRST_OF_PAIR PF_READS_ALIGNED 69453601
FIRST_OF_PAIR PCT_PF_READS_ALIGNED 0.984811
FIRST_OF_PAIR PF_ALIGNED_BASES 9620894381
FIRST_OF_PAIR PF_HQ_ALIGNED_READS 50804548
FIRST_OF_PAIR PF_HQ_ALIGNED_BASES 7174609113
FIRST_OF_PAIR PF_HQ_ALIGNED_Q20_BASES 7105678130
FIRST_OF_PAIR PF_HQ_MEDIAN_MISMATCHES 2
FIRST_OF_PAIR PF_MISMATCH_RATE 0.035645
FIRST_OF_PAIR PF_HQ_ERROR_RATE 0.03284
FIRST_OF_PAIR PF_INDEL_RATE 0.002595
FIRST_OF_PAIR MEAN_READ_LENGTH 147.023646
FIRST_OF_PAIR READS_ALIGNED_IN_PAIRS 69198407
FIRST_OF_PAIR PCT_READS_ALIGNED_IN_PAIRS 0.996326
FIRST_OF_PAIR BAD_CYCLES 0
FIRST_OF_PAIR STRAND_BALANCE 0.500124
FIRST_OF_PAIR PCT_CHIMERAS 0.182996
FIRST_OF_PAIR PCT_ADAPTER 0.000006
SECOND_OF_PAIR TOTAL_READS 70524781
SECOND_OF_PAIR PF_READS 70524781
SECOND_OF_PAIR PCT_PF_READS 1
SECOND_OF_PAIR PF_NOISE_READS 0
SECOND_OF_PAIR PF_READS_ALIGNED 69469731
SECOND_OF_PAIR PCT_PF_READS_ALIGNED 0.98504
SECOND_OF_PAIR PF_ALIGNED_BASES 9752646754
SECOND_OF_PAIR PF_HQ_ALIGNED_READS 50858464
SECOND_OF_PAIR PF_HQ_ALIGNED_BASES 7273428558
SECOND_OF_PAIR PF_HQ_ALIGNED_Q20_BASES 7233606433
SECOND_OF_PAIR PF_HQ_MEDIAN_MISMATCHES 2
SECOND_OF_PAIR PF_MISMATCH_RATE 0.035133
SECOND_OF_PAIR PF_HQ_ERROR_RATE 0.032285
SECOND_OF_PAIR PF_INDEL_RATE 0.00263
SECOND_OF_PAIR MEAN_READ_LENGTH 148.983855
SECOND_OF_PAIR READS_ALIGNED_IN_PAIRS 69198407
SECOND_OF_PAIR PCT_READS_ALIGNED_IN_PAIRS 0.996094
SECOND_OF_PAIR BAD_CYCLES 0
SECOND_OF_PAIR STRAND_BALANCE 0.500204
SECOND_OF_PAIR PCT_CHIMERAS 0.182996
SECOND_OF_PAIR PCT_ADAPTER 0.000001
PAIR TOTAL_READS 141049562
PAIR PF_READS 141049562
PAIR PCT_PF_READS 1
PAIR PF_NOISE_READS 0
PAIR PF_READS_ALIGNED 138923332
PAIR PCT_PF_READS_ALIGNED 0.984926
PAIR PF_ALIGNED_BASES 19373541135
PAIR PF_HQ_ALIGNED_READS 101663012
PAIR PF_HQ_ALIGNED_BASES 14448037671
PAIR PF_HQ_ALIGNED_Q20_BASES 14339284563
PAIR PF_HQ_MEDIAN_MISMATCHES 2
PAIR PF_MISMATCH_RATE 0.035387
PAIR PF_HQ_ERROR_RATE 0.03256
PAIR PF_INDEL_RATE 0.002613
PAIR MEAN_READ_LENGTH 148.00375
PAIR READS_ALIGNED_IN_PAIRS 138396814
PAIR PCT_READS_ALIGNED_IN_PAIRS 0.99621
PAIR BAD_CYCLES 0
PAIR STRAND_BALANCE 0.500164
PAIR PCT_CHIMERAS 0.182996
PAIR PCT_ADAPTER 0.000003
When I rerun the alignment summary metrics on the bam with duplicated marked, these questionable values seem to be rectified. The following are the results:
FIRST_OF_PAIR TOTAL_READS 70524781
FIRST_OF_PAIR PF_READS 70524781
FIRST_OF_PAIR PCT_PF_READS 1
FIRST_OF_PAIR PF_NOISE_READS 0
FIRST_OF_PAIR PF_READS_ALIGNED 69453601
FIRST_OF_PAIR PCT_PF_READS_ALIGNED 0.984811
FIRST_OF_PAIR PF_ALIGNED_BASES 9620894381
FIRST_OF_PAIR PF_HQ_ALIGNED_READS 50804548
FIRST_OF_PAIR PF_HQ_ALIGNED_BASES 7174609113
FIRST_OF_PAIR PF_HQ_ALIGNED_Q20_BASES 7105678130
FIRST_OF_PAIR PF_HQ_MEDIAN_MISMATCHES 2
FIRST_OF_PAIR PF_MISMATCH_RATE 0.035645
FIRST_OF_PAIR PF_HQ_ERROR_RATE 0.03284
FIRST_OF_PAIR PF_INDEL_RATE 0.002595
FIRST_OF_PAIR MEAN_READ_LENGTH 147.023646
FIRST_OF_PAIR READS_ALIGNED_IN_PAIRS 69198407
FIRST_OF_PAIR PCT_READS_ALIGNED_IN_PAIRS 0.996326
FIRST_OF_PAIR BAD_CYCLES 0
FIRST_OF_PAIR STRAND_BALANCE 0.500124
FIRST_OF_PAIR PCT_CHIMERAS 0.182996
FIRST_OF_PAIR PCT_ADAPTER 0.000006
SECOND_OF_PAIR TOTAL_READS 70524781
SECOND_OF_PAIR PF_READS 70524781
SECOND_OF_PAIR PCT_PF_READS 1
SECOND_OF_PAIR PF_NOISE_READS 0
SECOND_OF_PAIR PF_READS_ALIGNED 69469731
SECOND_OF_PAIR PCT_PF_READS_ALIGNED 0.98504
SECOND_OF_PAIR PF_ALIGNED_BASES 9752646754
SECOND_OF_PAIR PF_HQ_ALIGNED_READS 50858464
SECOND_OF_PAIR PF_HQ_ALIGNED_BASES 7273428558
SECOND_OF_PAIR PF_HQ_ALIGNED_Q20_BASES 7233606433
SECOND_OF_PAIR PF_HQ_MEDIAN_MISMATCHES 2
SECOND_OF_PAIR PF_MISMATCH_RATE 0.035133
SECOND_OF_PAIR PF_HQ_ERROR_RATE 0.032285
SECOND_OF_PAIR PF_INDEL_RATE 0.00263
SECOND_OF_PAIR MEAN_READ_LENGTH 148.983855
SECOND_OF_PAIR READS_ALIGNED_IN_PAIRS 69198407
SECOND_OF_PAIR PCT_READS_ALIGNED_IN_PAIRS 0.996094
SECOND_OF_PAIR BAD_CYCLES 0
SECOND_OF_PAIR STRAND_BALANCE 0.500204
SECOND_OF_PAIR PCT_CHIMERAS 0.182996
SECOND_OF_PAIR PCT_ADAPTER 0.000001
PAIR TOTAL_READS 141049562
PAIR PF_READS 141049562
PAIR PCT_PF_READS 1
PAIR PF_NOISE_READS 0
PAIR PF_READS_ALIGNED 138923332
PAIR PCT_PF_READS_ALIGNED 0.984926
PAIR PF_ALIGNED_BASES 19373541135
PAIR PF_HQ_ALIGNED_READS 101663012
PAIR PF_HQ_ALIGNED_BASES 14448037671
PAIR PF_HQ_ALIGNED_Q20_BASES 14339284563
PAIR PF_HQ_MEDIAN_MISMATCHES 2
PAIR PF_MISMATCH_RATE 0.035387
PAIR PF_HQ_ERROR_RATE 0.03256
PAIR PF_INDEL_RATE 0.002613
PAIR MEAN_READ_LENGTH 148.00375
PAIR READS_ALIGNED_IN_PAIRS 138396814
PAIR PCT_READS_ALIGNED_IN_PAIRS 0.99621
PAIR BAD_CYCLES 0
PAIR STRAND_BALANCE 0.500164
PAIR PCT_CHIMERAS 0.182996
PAIR PCT_ADAPTER 0.000003
Any suggestions on what might be amiss would be greatly appreciate.
If you had trimmed the paired-end read files independently then it is possible that they reads went out of sync in the two files. You can use
repair.sh
from BBMap suite to fix that issue or re-do the trimming using both files in the same trim run.I trimmed both reads together using