Hi everybody! I'm new to bioinformatics and to this forum. I have Illumina paired-end whole genome sequencing data (read length is 150bp) aligned with bwa. There are regions where all reads are mapped with zero quality. I took all reads covering a specific coordinate from such region:
$ samtools view algn_srt.bam "chr21:43063074-43063074" | cut -f 1 | sort | uniq -c
This results in 28 reads:
1 V300033420L1C001R0181033041
1 V300033420L1C001R0300317239
1 V300033420L1C001R0470415198
1 V300033420L1C003R0011383109
1 V300033420L1C003R0081312154
2 V300033420L1C003R0130543193
1 V300033420L1C003R0140107287
1 V300033420L1C003R0170739418
1 V300033420L1C003R0230130936
1 V300033420L1C003R0231300468
1 V300033420L1C003R0300977382
3 V300033420L1C003R0371074884
2 V300033420L1C003R0371074885
1 V300033420L1C003R0510100518
1 V300033420L1C003R0671044234
1 V300033420L1C004R0210447304
1 V300033420L1C004R0371389897
1 V300033420L1C004R0380756132
1 V300033420L1C004R0550261504
1 V300033420L1C005R0090362628
2 V300033420L1C005R0090658320
1 V300033420L1C005R0160918804
1 V300033420L1C005R0330628576
1 V300033420L1C005R0391023819
1 V300033420L1C006R0021214012
1 V300033420L1C006R0060892638
1 V300033420L1C006R0060892639
1 V300033420L1C006R0161090838
Most reads cross the specified coordinate exactly once, but three reads cross it 2 times (I guess that's an indication of a possible deletion, because paired reads are displayed with their common base name), and there is one read which crosses the coordinate 3 times (how is this possible btw?).
Most reads aligned with CIGAR string 150M
:
$ samtools view algn_srt.bam "chr21:43063074-43063074" | cut -f 6 | grep 150M | wc -l
20
Next I searched for other places where the reads are mapped:
$ samtools view algn_srt.bam | grep -f read_names.txt >> mq_0.sam
$ samtools view mq_0.sam | cut -f 1 | sort | uniq -c
2 V300033420L1C001R0181033041
2 V300033420L1C001R0300317239
3 V300033420L1C001R0470415198
2 V300033420L1C003R0011383109
2 V300033420L1C003R0081312154
2 V300033420L1C003R0130543193
2 V300033420L1C003R0140107287
2 V300033420L1C003R0170739418
2 V300033420L1C003R0230130936
2 V300033420L1C003R0231300468
2 V300033420L1C003R0300977382
3 V300033420L1C003R0371074884
3 V300033420L1C003R0371074885
2 V300033420L1C003R0510100518
2 V300033420L1C003R0671044234
2 V300033420L1C004R0210447304
2 V300033420L1C004R0371389897
2 V300033420L1C004R0380756132
2 V300033420L1C004R0550261504
2 V300033420L1C005R0090362628
3 V300033420L1C005R0090658320
2 V300033420L1C005R0160918804
2 V300033420L1C005R0330628576
2 V300033420L1C005R0391023819
2 V300033420L1C006R0021214012
2 V300033420L1C006R0060892638
2 V300033420L1C006R0060892639
2 V300033420L1C006R0161090838
Most reads are encountered two times, that's because their mates are displayed with the same name. Only two reads map to a distant place:
V300033420L1C003R0371074885 339 chr21 6454741 0 82H68M = 43062949 36608142 GATCCACCCCAGTGATCTGCAGAGGGCGCGGCTTCAGGGCTCAAGGCCAGCAAAAGCCCCGCCTGGAC GFFFGFFCDFGGEFFFFFFGEFGGGF>FFGFFFFFEGFGG>FGFFFFFFFFFFFFFFDFFCGFCFFGF SA:Z:chr21,43063049,-,81M69S,0,2; XA:Z:chr21,-43063063,82S68M,1; MC:Z:150M MD:Z:11A56 NM:i:1 AS:i:63 XS:i:63
V300033420L1C001R0470415198 81 chr21 6454910 0 150M = 43063063 36608005 GGATGAGAAACCCAAGAACCCAGGCCCCAAGTATTAAAGCGGTAATGACCTAAATGTTCAGTGCAGGGCGCTCAGAAGTAAAGGCAGGTGCCTGGGAGGAACTGACCTCCCCGGGGTGCTTCTCGGAGCGGGGCCCCCAGGACAGTCCAG E=FEFFGFDCFDGFDGF.GFGFG0F'FGGFGEFFFGFBAFEG6FEFFCFFFF?GFFF3GFFFGEFGFFGDEDFGFGGGGGFGGGGFFFGFGFFAFFFFFGGGGGFFFFFFFGGGFFGGF;FGFFGFFGFFGFFFFFFEFF
So the question is: why mapping quality of all these reads is zero, when only two of them have another location where they are mapped?
Thanks!
Can you also post the parameters used with
bwa
for the original alignments?Bwa was piped after picard SamToFastq:
Look at the SAM flags (the second field from a SAM record), it could shed some light on this issue:
The output: