I am mapping illumina PE total RNA-seq reads to a human reference with the aim to keep and analyse the unmapped reads for microbial content. This is the command I am using:
bbmap.sh in1=R1.fastq.gz in2=R2.gastq.gz outm=mapped.fastq.gz outu=unmapped.fastq.gz maxindel=200k ambig=random intronlen=20 xstag=us
The out I get from the mapping is this:
Read 1 data: pct reads num reads pct bases num bases
mapped: 1.4446% 4616012 1.4214% 444831621
unambiguous: 0.5291% 1690797 0.5257% 164530054
ambiguous: 0.9155% 2925215 0.8957% 280301567
low-Q discards: 0.1385% 442659 0.0495% 15493065
perfect best site: 0.2236% 714613 0.2271% 71086566
semiperfect site: 0.2257% 721039 0.2292% 71728958
rescued: 0.0340% 108781
Match Rate: NA NA 45.0251% 437905504
Error Rate: 84.5007% 3900565 54.9729% 534655405
Sub Rate: 83.8672% 3871322 0.6738% 6553495
Del Rate: 3.8987% 179964 54.2627% 527748664
Ins Rate: 1.6776% 77440 0.0363% 353246
N Rate: 0.2227% 10282 0.0020% 19376
Splice Rate: 2.6624% 122895 (splices at least 20 bp)
Read 2 data: pct reads num reads pct bases num bases
mapped: 1.4562% 4653230 1.4351% 449664382
unambiguous: 0.5736% 1832980 0.5704% 178723715
ambiguous: 0.8826% 2820250 0.8647% 270940667
low-Q discards: 0.1382% 441542 0.0498% 15597422
perfect best site: 0.2657% 849081 0.2699% 84561207
semiperfect site: 0.2661% 850225 0.2702% 84672337
rescued: 0.0534% 170740
Match Rate: NA NA 49.6060% 442644534
Error Rate: 81.7453% 3803799 50.3914% 449653141
Sub Rate: 81.0799% 3772836 0.7479% 6674030
Del Rate: 3.6818% 171322 49.6073% 442656394
Ins Rate: 1.5347% 71415 0.0362% 322717
N Rate: 0.1796% 8358 0.0026% 23101
Splice Rate: 2.5801% 120056 (splices at least 20 bp)
Total time: 35599.792 seconds.
There is a low mapping rate which I am happy with, I wanted minimal human contamination as possible. But I should I be concerned about the Match Rate for read 1 (NA) and the high Error Rate for both read 1 and read 2 (>80%). What exactly does the error rate mean?
Thanks Geno, On that this is the pairing data:
Is the percent pairs a concern?
No because your number matches closely (will generally be a few points lower) with your alignment % for Read 1 and 2.
Ahh ok, so the pairing data is that just for the reads that map to the reference?
Mated pair data indicates that the reads mapped to the reference at an expected distance.