Question

low mapping rate with bowtie2 and BWA, but not blast

0

Entering edit mode

5.9 years ago

zhangdengwei ▴ 210

Dear all, I'd like to align the WGS data derived from PE50 to the human genome with BWA or Bowtie2, but the overall mapping rate was as low as 40%. Below is my command.

nohup bwa mem -t 48 GRCh38.bwa L01_5-8_1.fq.gz L01_5-8_2.fq.gz -o align.pe.sam &
nohup time bowtie2 -k 1 -p 36 -x GRCh38.p12 -1 L01_5-8_1.fq.gz -2 L01_5-8_2.fq.gz -S align.pe.bowtie2.sam &

Then I extracted out the unmapped reads and wonder the reason why they could not be aligned to the reference, so I utilized blastn to double check what the unmapped reads exactly were. Interesting, the result generated by blastn revealed all most unmapped reads were from the human reference, indeed. The following is my command of blastn and part of the result, the 4th column of which is mapped length.

nohup blastn -query unmapped_reads.fa -strand both -db GRCh38 -out out.outfmt6_nt.txt -outfmt "6 qaccver saccver pident length mismatch gapopen qstart qend sstart send evalue bitscore qlen slen sstrand covhsp stitle" -num_threads 40 -max_target_seqs 1 -max_hsps 1 -dust no &

CL200085426L1C001R001_1175_2    chr9    100.000 39      0       0       11      49      118071004       118070966       6.85e-12        73.1    49
CL200085426L1C001R001_740_1     chr5    93.878  49      3       0       2       50      49602840        49602888        5.53e-13        76.8    50
CL200085426L1C001R001_2083_1    chr7    100.000 30      0       0       1       30      154413262       154413291       7.20e-07        56.5    50
CL200085426L1C001R001_2722_1    chr1    98.000  50      1       0       1       50      225676657       225676706       7.10e-17        89.8    50
CL200085426L1C001R001_2771_1    chr2    98.000  50      1       0       1       50      74907622        74907573        7.10e-17        89.8    50
CL200085426L1C001R001_2906_1    chrX    97.826  46      1       0       3       48      68388262        68388307        1.09e-14        82.4    48
CL200085426L1C001R001_2906_2    chrY    100.000 38      0       0       1       38      12035127        12035164        2.57e-11        71.3    50
CL200085426L1C001R001_2717_1    chr7    96.000  50      2       0       1       50      19529382        19529431        3.30e-15        84.2    50

Is there anyone have encountered such problem, and how can I handle that? Thanks very much!

RNA-Seq sequencing alignment • 1.7k views

ADD COMMENT • link updated 5.9 years ago by h.mon 35k • written 5.9 years ago by zhangdengwei ▴ 210

score 0 · Answer 1 · 2019-05-20

0

Entering edit mode

5.9 years ago

h.mon 35k

Looking at the blast output, the column which indicates percent identity ranges from 56% to 89%. Most short read mappers expect reads to be nearly identical to the reference genome, and won't map reads with too many mismatches.

This low percent identity also indicates these reads are not human, did you try blasting to a more comprehensive database such as NCBI NT?