We are a small group of undergrads, mostly sophomore, from a small HBCU, and learning bioinformatics and genomics that is not at all part of our regular syllabus, by trying to teach one another :)
Right now, we are focused on how to use BBmap
and other such tools to process Illumina raw reads - but we are new to all of this, and experiencing a steep learning curve.
So could someone please educate us on how to interpret the **BBDUK.sh**
stats options that create the various histogram files that are generated by invoking the following flags: qhist
, qchist
, aqhist
, bqhist
, and lhist
?
From several different posts here on BioStars and a few on SeqAnswers, we decided on this syntax for BBDUK based adapter and quality trimming, to keep it gentle (not too stringent) but also impose min length to prevent very crazy multi-mapping during read alignment to the reference genome (we're not there yet):
bbduk.sh -Xmx34g in=TrialSRR.fq.gz \
out=TrialSRR_AdQualTrm_minlen20.fq.gz \
ktrim=r k=23 mink=11 minlen=20 ref=adapters ordered threads=1 \
ftr=99 qtrim=r trimq=15 \
qhist=TrialSRR_minlen20.qhist \
qchist=TrialSRR_minlen20.qchist \
aqhist=TrialSRR_minlen20.aqhist \
bqhist=TrialSRR_minlen20.bqhist \
lhist=TrialSRR_minlen20.lhist;
When we generated these files, we see their contents as shown below. From these contents, think we understand a few of these numbers and column headings, but not sure that we do. Some things we gathered from this forum and the internet in general, as well as our educated guesses, are listed below. We request forum experts to add, comment, correct these inferences, and answer our pending questions. Thank you!
- In our data, the maximum Illumina read length is 100 + 1 where the final position is poorly calibrated and ideally trimmed out. Base# here varies from 1-101 (human coding) or 0-100 (machine coding starting with 0 index)?
- In our data, 40 is the highest PHRED score we see in our input Illumina FASTQ file, which is also the empirical maximum for the HiSeq 2500 platform used to create these data, or is the maximum 50 ?
- In QHIST file, column total under count1 is # of nt for each PHRED score between 0 and 41 - totaling, 1614284414 and total fraction1 is 1.00003.
- 15983014 reads * (100+1) nt / read = 1614284414 total nt across all reads in this library, so that tallies. It is not clear where this 1 extra nt comes from during sequencing, even from reading from other Biostars posts. Please help. We know this 1 nt must be removed, since it almost always poor quality, right?
- In the BQHIST file, before any trimming of adapter or by quality, in the input file, where reads are uniformly 100nt long, these gives the stats for each of those 100 positions, right? As expected, PHRED score improves from 5'end downstream and slightly degrades at the 3' end, right?
- In QHIST file, no idea what the headers mean - Read1_linear or Read1_log ?
- In AQHIST file, we think the count1 is total number of input raw reads = 15983014 here, and we are guessing the number of reads with average PHRED score across it's entire length, and their corresponding fraction of the total is the third column, but not sure at all.
We guess the BQHIST headers mean the following: - but what are LW_1 and RW_1?
BaseNum=nt position in read
count_1=total # reads analyzed
min_1=minimum PHRED score at this position
max_1=maximum PHRED score at this position
mean_1=average PHRED score at this position
Q1_1=1st quartile of PHRED score at this position
med_1=Median or 2nd quartile of PHRED score at this position
Q3_1=3rd quartile of PHRED score at this position
**LW_1 = ??? not sure!** Help!
**RW_1 = ??? not sure!** Help!
Once again, thanks in advance!
PS. Apologies for any weirdness in our post's formatting, it's our first time here, and trying to use this markdown system....
HIST file listing
-rw-rw-r-- 1 lotte lotte 27 Nov 16 21:00 TrialSRR_minlen20.lhist
-rw-rw-r-- 1 lotte lotte 820 Nov 16 21:00 TrialSRR_minlen20.qchist
-rw-rw-r-- 1 lotte lotte 3.8K Nov 16 21:00 TrialSRR_minlen20.bqhist
-rw-rw-r-- 1 lotte lotte 1.7K Nov 16 21:00 TrialSRR_minlen20.qhist
-rw-rw-r-- 1 lotte lotte 740 Nov 16 21:00 TrialSRR_minlen20.aqhis
LHIST file contents
lotte@Uni-HPC:~Learning$ wc -l TrialSRR_minlen20.lhist
2 TrialSRR_minlen20.lhist
lotte@Uni-HPC:~Learning$ cat TrialSRR_minlen20.lhist
#Length Count
101 15983014
QCHIST file contents
lotte@Uni-HPC:~Learning$ wc -l TrialSRR_minlen20.qchist
43 TrialSRR_minlen20.qchist
lotte@Uni-HPC:~Learning$ cat TrialSRR_minlen20.qchist
#Quality count1 fraction1
0 277949 0.00017
1 0 0.00000
2 57657078 0.03572
3 0 0.00000
4 0 0.00000
5 369144 0.00023
6 1014149 0.00063
7 5704849 0.00353
8 5889111 0.00365
9 3397890 0.00210
10 4329169 0.00268
11 1607492 0.00100
12 1061212 0.00066
13 2845130 0.00176
14 1151754 0.00071
15 2854771 0.00177
16 4366874 0.00271
17 3243345 0.00201
18 8082441 0.00501
19 5925749 0.00367
20 6627712 0.00411
21 3527369 0.00219
22 6282643 0.00389
23 8649813 0.00536
24 11327269 0.00702
25 17252728 0.01069
26 23676889 0.01467
27 19632016 0.01216
28 17542010 0.01087
29 31795702 0.01970
30 44642598 0.02765
31 67690229 0.04193
32 53267289 0.03300
33 66360275 0.04111
34 129894448 0.08047
35 164453500 0.10187
36 90828097 0.05627
37 129516469 0.08023
38 114018646 0.07063
39 140521421 0.08705
40 173852845 0.10770
41 183146339 0.11345
BQHIST file contents
lotte@Uni-HPC:~Learning$ wc -l TrialSRR_minlen20.bqhist
102 TrialSRR_minlen20.bqhist
lotte@Uni-HPC:~Learning$ cat TrialSRR_minlen20.bqhist
#BaseNum count_1 min_1 max_1 mean_1 Q1_1 med_1 Q3_1 LW_1 RW_1
0 15983014 0 34 31.06 31 31 33 23 34
1 15983014 2 34 30.70 31 31 34 16 34
2 15983014 2 34 28.71 27 31 34 10 34
3 15983014 2 37 33.42 33 35 37 10 37
4 15983014 2 37 34.49 35 35 37 19 37
5 15983014 2 37 34.86 35 35 37 23 37
6 15983014 2 37 35.08 35 36 37 25 37
7 15983014 2 37 35.12 35 37 37 25 37
8 15983014 2 39 36.66 35 39 39 25 39
9 15983014 2 39 36.65 35 39 39 25 39
10 15983014 2 39 36.66 35 39 39 25 39
11 15983014 2 39 36.39 35 38 39 19 39
12 15983014 2 39 36.47 35 38 39 23 39
13 15983014 2 41 37.58 36 39 41 24 41
14 15983014 2 41 37.66 36 39 41 25 41
15 15983014 2 41 37.57 36 39 41 24 41
16 15983014 2 41 37.28 36 39 41 17 41
17 15983014 2 41 37.45 36 39 41 23 41
18 15983014 2 41 37.60 36 39 41 24 41
19 15983014 0 41 37.58 36 39 41 24 41
20 15983014 0 41 37.58 36 39 41 24 41
21 15983014 0 41 37.52 36 39 41 24 41
22 15983014 0 41 37.46 36 39 41 23 41
23 15983014 0 41 37.39 36 39 41 22 41
24 15983014 0 41 37.34 36 39 41 21 41
25 15983014 0 41 37.09 36 39 40 18 41
26 15983014 0 41 37.05 36 39 40 18 41
27 15983014 0 41 36.45 36 39 40 10 41
28 15983014 0 41 36.70 36 39 40 16 41
29 15983014 0 41 36.72 36 39 40 16 41
30 15983014 0 41 36.55 36 39 40 15 41
31 15983014 0 41 36.42 35 38 40 15 41
32 15983014 0 41 36.46 35 38 40 15 41
33 15983014 0 41 36.42 35 38 40 15 41
34 15983014 0 41 36.20 35 38 40 9 41
35 15983014 0 41 35.95 35 38 40 9 41
36 15983014 0 41 36.29 35 38 40 15 41
37 15983014 0 41 36.24 35 38 40 9 41
38 15983014 0 41 36.14 35 38 40 9 41
39 15983014 0 41 35.97 35 38 40 9 41
40 15983014 0 41 36.12 35 38 40 9 41
41 15983014 0 41 36.09 35 38 40 9 41
42 15983014 0 41 36.03 35 38 40 9 41
43 15983014 0 41 35.82 35 38 40 9 41
44 15983014 2 41 35.93 35 38 40 9 41
45 15983014 2 41 36.10 35 38 40 9 41
46 15983014 0 41 35.82 35 38 40 8 41
47 15983014 0 41 35.87 35 38 40 9 41
48 15983014 0 41 35.97 35 38 40 9 41
49 15983014 0 41 35.91 35 38 40 8 41
50 15983014 0 41 35.47 34 38 40 7 41
51 15983014 2 41 35.55 34 38 40 7 41
52 15983014 2 41 35.56 34 38 40 7 41
53 15983014 2 41 35.51 34 38 40 7 41
54 15983014 2 41 35.18 34 38 40 2 41
55 15983014 2 41 35.16 34 38 40 2 41
56 15983014 2 41 34.94 33 38 40 2 41
57 15983014 2 41 34.87 33 38 40 2 41
58 15983014 2 41 34.83 33 38 40 2 41
59 15983014 2 41 34.71 33 38 40 2 41
60 15983014 2 41 34.53 33 37 40 2 41
61 15983014 2 41 34.36 33 37 40 2 41
62 15983014 2 41 34.15 32 37 40 2 41
63 15983014 2 41 34.03 32 37 40 2 41
64 15983014 2 41 33.92 32 37 39 2 41
65 15983014 2 41 33.65 32 36 39 2 41
66 15983014 2 41 33.18 31 36 39 2 41
67 15983014 2 41 33.09 31 36 39 2 41
68 15983014 2 41 32.85 31 36 39 2 41
69 15983014 2 41 32.72 31 35 39 2 41
70 15983014 2 41 32.42 31 35 38 2 41
71 15983014 0 41 32.24 31 35 38 2 41
72 15983014 0 41 32.11 31 35 38 2 41
73 15983014 0 41 31.94 31 35 37 2 41
74 15983014 0 41 31.71 30 35 37 2 41
75 15983014 0 41 30.53 29 34 36 2 40
76 15983014 0 41 30.96 30 34 36 2 40
77 15983014 0 41 31.00 30 34 36 2 40
78 15983014 0 41 31.11 30 34 36 2 40
79 15983014 0 41 30.98 30 34 36 2 40
80 15983014 0 41 30.80 30 34 36 2 39
81 15983014 0 41 30.67 30 34 36 2 39
82 15983014 0 41 30.53 30 34 35 2 39
83 15983014 0 41 30.31 30 34 35 2 39
84 15983014 0 41 29.73 29 34 35 2 39
85 15983014 0 41 29.65 29 34 35 2 38
86 15983014 0 41 29.50 29 34 35 2 37
87 15983014 0 41 29.18 29 34 35 2 37
88 15983014 2 41 29.06 29 34 35 2 37
89 15983014 0 41 28.79 29 34 35 2 37
90 15983014 2 41 28.79 29 34 35 2 37
91 15983014 0 41 28.68 29 33 35 2 36
92 15983014 0 41 28.49 29 33 35 2 36
93 15983014 0 41 28.34 29 33 35 2 36
94 15983014 2 41 28.11 29 33 35 2 36
95 15983014 0 41 27.54 27 33 35 2 36
96 15983014 2 40 27.41 27 33 35 2 36
97 15983014 2 40 27.13 26 33 35 2 36
98 15983014 2 40 26.88 26 33 35 2 36
99 15983014 2 39 26.50 25 33 35 2 36
100 15983014 0 39 24.30 19 30 34 2 35
QHIST file contents
lotte@Uni-HPC:~Learning$ wc -l TrialSRR_minlen20.qhist
102 TrialSRR_minlen20.qhist
lotte@Uni-HPC:~Learning$ cat TrialSRR_minlen20.qhist
#BaseNum Read1_linear Read1_log
1 31.063 26.885
2 30.698 24.777
3 28.714 20.556
4 33.421 24.922
5 34.491 29.127
6 34.864 29.350
7 35.080 29.805
8 35.118 30.052
9 36.658 30.166
10 36.652 30.040
11 36.662 30.395
12 36.389 28.285
13 36.465 29.323
14 37.579 29.032
15 37.663 30.035
16 37.574 29.065
17 37.282 27.307
18 37.446 29.248
19 37.598 29.421
20 37.576 29.258
21 37.578 29.198
22 37.519 28.767
23 37.462 28.431
24 37.387 27.753
25 37.338 27.351
26 37.086 25.540
27 37.050 25.438
28 36.447 23.315
29 36.702 24.383
30 36.717 24.062
31 36.555 23.301
32 36.422 22.903
33 36.459 22.835
34 36.423 22.594
35 36.197 21.844
36 35.950 21.385
37 36.290 21.849
38 36.243 21.372
39 36.145 20.975
40 35.969 20.686
41 36.124 20.852
42 36.087 20.575
43 36.031 20.427
44 35.823 19.860
45 35.934 20.178
46 36.099 20.175
47 35.817 18.978
48 35.872 19.512
49 35.970 19.451
50 35.906 19.198
51 35.474 18.051
52 35.548 18.662
53 35.564 18.477
54 35.506 18.265
55 35.178 17.768
56 35.164 17.730
57 34.940 17.352
58 34.868 17.281
59 34.826 17.139
60 34.712 16.925
61 34.534 16.697
62 34.362 16.509
63 34.147 16.268
64 34.035 16.146
65 33.921 15.996
66 33.645 15.748
67 33.178 15.393
68 33.091 15.371
69 32.846 15.159
70 32.724 15.043
71 32.425 14.815
72 32.241 14.697
73 32.106 14.588
74 31.937 14.460
75 31.713 14.296
76 30.530 14.150
77 30.962 14.091
78 30.998 13.886
79 31.111 13.882
80 30.976 13.727
81 30.800 13.572
82 30.667 13.455
83 30.529 13.311
84 30.307 13.133
85 29.732 12.769
86 29.653 12.709
87 29.497 12.535
88 29.178 12.281
89 29.063 12.139
90 28.794 11.890
91 28.790 11.777
92 28.681 11.583
93 28.490 11.350
94 28.342 11.122
95 28.105 10.832
96 27.542 10.419
97 27.407 10.170
98 27.129 9.831
99 26.876 9.495
100 26.500 9.128
101 24.302 8.607
AQHIST file contents
lotte@Uni-HPC:~Learning$ wc -l TrialSRR_minlen20.aqhist
42 TrialSRR_minlen20.aqhist
lotte@Uni-HPC:~Learning$ cat TrialSRR_minlen20.aqhist
#Quality count1 fraction1
0 0 0.00000
1 0 0.00000
2 1003 0.00006
3 40192 0.00251
4 128895 0.00806
5 199688 0.01249
6 226919 0.01420
7 213066 0.01333
8 174075 0.01089
9 206481 0.01292
10 216097 0.01352
11 226341 0.01416
12 228745 0.01431
13 249807 0.01563
14 258218 0.01616
15 227851 0.01426
16 229582 0.01436
17 253982 0.01589
18 238962 0.01495
19 307747 0.01925
20 300699 0.01881
21 387786 0.02426
22 478053 0.02991
23 416609 0.02607
24 464175 0.02904
25 537010 0.03360
26 619228 0.03874
27 531987 0.03328
28 577129 0.03611
29 637311 0.03987
30 330176 0.02066
31 416774 0.02608
32 542824 0.03396
33 699675 0.04378
34 771215 0.04825
35 881653 0.05516
36 1175349 0.07354
37 1504781 0.09415
38 974260 0.06096
39 108547 0.00679
40 122 0.00001
Thank you so very much for easy to understand answers not loaded with excessive jargon or technically complexities.
I think the only question yet to be answered is: what are LW_1 and RW_1? Hopefully someone answers that as well.
We do have one follow-up question - I think you meant to explain that PHRED score drop at position 1 is expected, but not so much at position 2 and 3, right? If that is the case, then is this symptomatic of some sort of sequencing or processing problem that can be fixed at the processing step, before mapping these RNA-seq reads to a reference genome? Please advice.
BTW, forum members here so awesome, friendly and helpful :) Thanks again to forum members and maintainers...
LW_1 and RW_1 are the boxplot whisker values, set at the 2nd and 98th percentiles.
I wouldn't bother too much at the initial drop in quality, as it is short (positions 2 and 3) and not very strong (average quality is still 20). You could have asked for the base composition histogram (bhist) to check if there is an acumulation of
N
s at these positions.Thank you for your reply and advice.
I suppose LW and RW might be acronyms for left and right whiskers respectively.
We will inquire for N % in the histogram results. Thanks for that suggestion.
Sorry for the late response, we all had papers, exams etc :)