Question

Help understanding BBMAP output

0

Entering edit mode

2.8 years ago

m.radz ▴ 10

I am mapping illumina PE total RNA-seq reads to a human reference with the aim to keep and analyse the unmapped reads for microbial content. This is the command I am using:

bbmap.sh in1=R1.fastq.gz in2=R2.gastq.gz outm=mapped.fastq.gz outu=unmapped.fastq.gz maxindel=200k ambig=random intronlen=20 xstag=us

The out I get from the mapping is this:

Read 1 data:            pct reads       num reads       pct bases          num bases

mapped:                   1.4446%         4616012         1.4214%          444831621
unambiguous:              0.5291%         1690797         0.5257%          164530054
ambiguous:                0.9155%         2925215         0.8957%          280301567
low-Q discards:           0.1385%          442659         0.0495%           15493065

perfect best site:        0.2236%          714613         0.2271%           71086566
semiperfect site:         0.2257%          721039         0.2292%           71728958
rescued:                  0.0340%          108781

Match Rate:                   NA               NA        45.0251%          437905504
Error Rate:              84.5007%         3900565        54.9729%          534655405
Sub Rate:                83.8672%         3871322         0.6738%            6553495
Del Rate:                 3.8987%          179964        54.2627%          527748664
Ins Rate:                 1.6776%           77440         0.0363%             353246
N Rate:                   0.2227%           10282         0.0020%              19376
Splice Rate:              2.6624%          122895       (splices at least 20 bp)


Read 2 data:            pct reads       num reads       pct bases          num bases

mapped:                   1.4562%         4653230         1.4351%          449664382
unambiguous:              0.5736%         1832980         0.5704%          178723715
ambiguous:                0.8826%         2820250         0.8647%          270940667
low-Q discards:           0.1382%          441542         0.0498%           15597422

perfect best site:        0.2657%          849081         0.2699%           84561207
semiperfect site:         0.2661%          850225         0.2702%           84672337
rescued:                  0.0534%          170740

Match Rate:                   NA               NA        49.6060%          442644534
Error Rate:              81.7453%         3803799        50.3914%          449653141
Sub Rate:                81.0799%         3772836         0.7479%            6674030
Del Rate:                 3.6818%          171322        49.6073%          442656394
Ins Rate:                 1.5347%           71415         0.0362%             322717
N Rate:                   0.1796%            8358         0.0026%              23101
Splice Rate:              2.5801%          120056       (splices at least 20 bp)

Total time:             35599.792 seconds.

There is a low mapping rate which I am happy with, I wanted minimal human contamination as possible. But I should I be concerned about the Match Rate for read 1 (NA) and the high Error Rate for both read 1 and read 2 (>80%). What exactly does the error rate mean?

rna mapping bbmap • 1.9k views

ADD COMMENT • link updated 2.8 years ago by GenoMax 151k • written 2.8 years ago by m.radz ▴ 10

score 1 · Answer 1 · 2022-08-02

1

Entering edit mode

2.8 years ago

GenoMax 151k

There is no detailed explanation of how Brian calculates these myriad stats. This old post from Brian at SA is about the closest you are going to get: https://www.seqanswers.com/forum/bioinformatics/bioinformatics-aa/54070-error-rate-in-bbmap?p=278378#post278378

I personally only look at the mapped percentages of read 1 and read 2 along with the % of properly paired reads.

I am going to say that your results look good. You can take a few reads from unmapped pool and blast them at NCBI to see what you get. They should be non-human hits.

ADD COMMENT • link 2.8 years ago by GenoMax 151k

0

Entering edit mode

Thanks Geno, On that this is the pairing data:

Pairing data:           pct pairs       num pairs       pct bases          num bases

mated pairs:              1.2423%         3969692         1.2196%          763829132
bad pairs:                0.0347%          110810         0.0338%           21194634
insert size avg:          288.05

Is the percent pairs a concern?

ADD REPLY • link 2.8 years ago by m.radz ▴ 10

0

Entering edit mode

No because your number matches closely (will generally be a few points lower) with your alignment % for Read 1 and 2.

ADD REPLY • link 2.8 years ago by GenoMax 151k

0

Entering edit mode

Ahh ok, so the pairing data is that just for the reads that map to the reference?

ADD REPLY • link 2.8 years ago by m.radz ▴ 10

1

Entering edit mode

Mated pair data indicates that the reads mapped to the reference at an expected distance.

ADD REPLY • link 2.8 years ago by GenoMax 151k