The BBduk.sh removed the total of my sequences!
2
0
Entering edit mode
2.4 years ago
pavelasquezv ▴ 50

Hi all,

I am working in data mining of NCBI transcriptome data. To control quality I am using bbduk with the following command:

bbduk.sh in1=$l.fastq out=$l\_SR.fastq ref=adapters qtrim=lr \
trimq=10 overwrite=true ktrim=r  qskip=4 ways=$NSLOTS ftm=5 \
maq=10 minlen=20 trimpolya=10 trimpolyg=10 trimpolyc=10

I tried to modify the values of several parameters but it always removes 100% of most of the reads of the fastq files. Please if you have any suggestions to improve this command let me know. Many thanks!

This is an example of part of the output:

Filtered by header:             0 reads (0.00%)         0 bases (0.00%)
Low quality discards:           0 reads (0.00%)         0 bases (0.00%)
Total Removed:                  2041 reads (0.02%)      51814939 bases (8.20%)
Filtered by header:             0 reads (0.00%)         0 bases (0.00%)
Low quality discards:           0 reads (0.00%)         0 bases (0.00%)
Total Removed:                  2214 reads (0.02%)      51939725 bases (8.21%)
Filtered by header:             54054137 reads (100.00%)        2702706850 bases (100.00%)
Low quality discards:           54054137 reads (100.00%)        2702706850 bases (100.00%)
Total Removed:                  54054137 reads (100.00%)        2702706850 bases (100.00%)
Filtered by header:             21331917 reads (100.00%)        1066595850 bases (100.00%)
Low quality discards:           21331917 reads (100.00%)        1066595850 bases (100.00%)
Total Removed:                  21331917 reads (100.00%)        1066595850 bases (100.00%)
Filtered by header:             22840621 reads (100.00%)        1142031050 bases (100.00%)
Low quality discards:           22840621 reads (100.00%)        1142031050 bases (100.00%)
Total Removed:                  22840621 reads (100.00%)        1142031050 bases (100.00%)
Filtered by header:             24084611 reads (100.00%)        1204230550 bases (100.00%)
Low quality discards:           24084611 reads (100.00%)        1204230550 bases (100.00%)
Total Removed:                  24084611 reads (100.00%)        1204230550 bases (100.00%)
Filtered by header:             27595642 reads (100.00%)        1379782100 bases (100.00%)
Low quality discards:           27595642 reads (100.00%)        1379782100 bases (100.00%)
Total Removed:                  27595642 reads (100.00%)        1379782100 bases (100.00%)
Filtered by header:             7218527 reads (100.00%)         353707823 bases (100.00%)
Low quality discards:           7218527 reads (100.00%)         353707823 bases (100.00%)
Total Removed:                  7218527 reads (100.00%)         353707823 bases (100.00%)
Filtered by header:             7059269 reads (100.00%)         345904181 bases (100.00%)
Low quality discards:           7059269 reads (100.00%)         345904181 bases (100.00%)
Total Removed:                  7059269 reads (100.00%)         345904181 bases (100.00%)
Filtered by header:             7607918 reads (100.00%)         372787982 bases (100.00%)
Low quality discards:           7607918 reads (100.00%)         372787982 bases (100.00%)
Total Removed:                  7607918 reads (100.00%)         372787982 bases (100.00%)
Filtered by header:             7262556 reads (100.00%)         355865244 bases (100.00%)
Low quality discards:           7262556 reads (100.00%)         355865244 bases (100.00%)
Total Removed:                  7262556 reads (100.00%)         355865244 bases (100.00%)
Filtered by header:             7371616 reads (100.00%)         361209184 bases (100.00%)
Low quality discards:           7371616 reads (100.00%)         361209184 bases (100.00%)
Total Removed:                  7371616 reads (100.00%)         361209184 bases (100.00%)
Filtered by header:             7270371 reads (100.00%)         356248179 bases (100.00%)
Low quality discards:           7270371 reads (100.00%)         356248179 bases (100.00%)
Total Removed:                  7270371 reads (100.00%)         356248179 bases (100.00%)
Filtered by header:             7007499 reads (100.00%)         343367451 bases (100.00%)
Low quality discards:           7007499 reads (100.00%)         343367451 bases (100.00%)
Total Removed:                  7007499 reads (100.00%)         343367451 bases (100.00%)
Filtered by header:             7447287 reads (100.00%)         364917063 bases (100.00%)
Low quality discards:           7447287 reads (100.00%)         364917063 bases (100.00%)
Total Removed:                  7447287 reads (100.00%)         364917063 bases (100.00%)
Filtered by header:             7322620 reads (100.00%)         358808380 bases (100.00%)
Low quality discards:           7322620 reads (100.00%)         358808380 bases (100.00%)
Total Removed:                  7322620 reads (100.00%)         358808380 bases (100.00%)
Filtered by header:             7218751 reads (100.00%)         353718799 bases (100.00%)
Low quality discards:           7218751 reads (100.00%)         353718799 bases (100.00%)
Total Removed:                  7218751 reads (100.00%)         353718799 bases (100.00%)
Filtered by header:             16728309 reads (100.00%)        836415450 bases (100.00%)
Low quality discards:           16728309 reads (100.00%)        836415450 bases (100.00%)
Total Removed:                  16728309 reads (100.00%)        836415450 bases (100.00%)
Filtered by header:             17878193 reads (100.00%)        893909650 bases (100.00%)
Low quality discards:           17878193 reads (100.00%)        893909650 bases (100.00%)
Total Removed:                  17878193 reads (100.00%)        893909650 bases (100.00%)
Filtered by header:             20845499 reads (100.00%)        1042274950 bases (100.00%)
Low quality discards:           20845499 reads (100.00%)        1042274950 bases (100.00%)
Total Removed:                  20845499 reads (100.00%)        1042274950 bases (100.00%)
Filtered by header:             15829049 reads (100.00%)        791452450 bases (100.00%)
Low quality discards:           15829049 reads (100.00%)        791452450 bases (100.00%)
Total Removed:                  15829049 reads (100.00%)        791452450 bases (100.00%)
Filtered by header:             17183836 reads (100.00%)        859191800 bases (100.00%)
Low quality discards:           17183836 reads (100.00%)        859191800 bases (100.00%)
Total Removed:                  17183836 reads (100.00%)        859191800 bases (100.00%)
Filtered by header:             19239225 reads (100.00%)        961961250 bases (100.00%)
Low quality discards:           19239225 reads (100.00%)        961961250 bases (100.00%)
Total Removed:                  19239225 reads (100.00%)        961961250 bases (100.00%)
Filtered by header:             16969394 reads (100.00%)        848469700 bases (100.00%)
Low quality discards:           16969394 reads (100.00%)        848469700 bases (100.00%)
Total Removed:                  16969394 reads (100.00%)        848469700 bases (100.00%)
Filtered by header:             17612386 reads (100.00%)        880619300 bases (100.00%)
Low quality discards:           17612386 reads (100.00%)        880619300 bases (100.00%)
Total Removed:                  17612386 reads (100.00%)        880619300 bases (100.00%)
Filtered by header:             15393894 reads (100.00%)        769694700 bases (100.00%)
Low quality discards:           15393894 reads (100.00%)        769694700 bases (100.00%)
Total Removed:                  15393894 reads (100.00%)        769694700 bases (100.00%)
Filtered by header:             16096774 reads (100.00%)        804838700 bases (100.00%)
Low quality discards:           16096774 reads (100.00%)        804838700 bases (100.00%)
Total Removed:                  16096774 reads (100.00%)        804838700 bases (100.00%)
Filtered by header:             4819322 reads (100.00%)         221688812 bases (100.00%)
Low quality discards:           4819322 reads (100.00%)         221688812 bases (100.00%)
Total Removed:                  4819322 reads (100.00%)         221688812 bases (100.00%)
Filtered by header:             3969200 reads (100.00%)         182583200 bases (100.00%)
Low quality discards:           3969200 reads (100.00%)         182583200 bases (100.00%)
Total Removed:                  3969200 reads (100.00%)         182583200 bases (100.00%)
Filtered by header:             6211484 reads (100.00%)         304362716 bases (100.00%)
Low quality discards:           6211484 reads (100.00%)         304362716 bases (100.00%)
Total Removed:                  6211484 reads (100.00%)         304362716 bases (100.00%)
Filtered by header:             5898028 reads (100.00%)         289003372 bases (100.00%)
Low quality discards:           5898028 reads (100.00%)         289003372 bases (100.00%)
Total Removed:                  5898028 reads (100.00%)         289003372 bases (100.00%)
Filtered by header:             5870395 reads (100.00%)         287649355 bases (100.00%)
Low quality discards:           5870395 reads (100.00%)         287649355 bases (100.00%)
Total Removed:                  5870395 reads (100.00%)         287649355 bases (100.00%)
Filtered by header:             6088303 reads (100.00%)         298326847 bases (100.00%)
Low quality discards:           6088303 reads (100.00%)         298326847 bases (100.00%)
Total Removed:                  6088303 reads (100.00%)         298326847 bases (100.00%)
Filtered by header:             0 reads (0.00%)         0 bases (0.00%)
Low quality discards:           0 reads (0.00%)         0 bases (0.00%)
Total Removed:                  1332119 reads (7.12%)   55294533 bases (8.44%)
Filtered by header:             0 reads (0.00%)         0 bases (0.00%)
Low quality discards:           0 reads (0.00%)         0 bases (0.00%)
Total Removed:                  1555910 reads (8.35%)   63889275 bases (9.80%)
Filtered by header:             0 reads (0.00%)         0 bases (0.00%)
Low quality discards:           0 reads (0.00%)         0 bases (0.00%)
Total Removed:                  2015331 reads (10.86%)  82945814 bases (12.77%)
Total Removed:                  5711034 reads (100.00%)         199886190 bases (100.00%)
Filtered by header:             2854429 reads (100.00%)         99905015 bases (100.00%)
Low quality discards:           2854429 reads (100.00%)         99905015 bases (100.00%)
Total Removed:                  2854429 reads (100.00%)         99905015 bases (100.00%)
Filtered by header:             5759018 reads (100.00%)         201565630 bases (100.00%)
Low quality discards:           5759018 reads (100.00%)         201565630 bases (100.00%)
Total Removed:                  5759018 reads (100.00%)         201565630 bases (100.00%)
Filtered by header:             20446366 reads (100.00%)        2065082966 bases (100.00%)
Low quality discards:           20446366 reads (100.00%)        2065082966 bases (100.00%)
Total Removed:                  20446366 reads (100.00%)        2065082966 bases (100.00%)
Filtered by header:             0 reads (0.00%)         0 bases (0.00%)
Low quality discards:           0 reads (0.00%)         0 bases (0.00%)
Total Removed:                  3864 reads (0.02%)      44301031 bases (2.20%)
Filtered by header:             4295725 reads (100.00%)         223377700 bases (100.00%)
Low quality discards:           4295725 reads (100.00%)         223377700 bases (100.00%)
Total Removed:                  4295725 reads (100.00%)         223377700 bases (100.00%)
Filtered by header:             6081361 reads (100.00%)         316230772 bases (100.00%)
Low quality discards:           6081361 reads (100.00%)         316230772 bases (100.00%)
Total Removed:                  6081361 reads (100.00%)         316230772 bases (100.00%)
Filtered by header:             4721831 reads (100.00%)         245535212 bases (100.00%)
Low quality discards:           4721831 reads (100.00%)         245535212 bases (100.00%)
Total Removed:                  4721831 reads (100.00%)         245535212 bases (100.00%)
Filtered by header:             5467768 reads (100.00%)         191371880 bases (100.00%)
Low quality discards:           5467768 reads (100.00%)         191371880 bases (100.00%)
Total Removed:                  5467768 reads (100.00%)         191371880 bases (100.00%)
Filtered by header:             15141684 reads (100.00%)        529958940 bases (100.00%)
Low quality discards:           15141684 reads (100.00%)        529958940 bases (100.00%)
Total Removed:                  15141684 reads (100.00%)        529958940 bases (100.00%)
Filtered by header:             15345206 reads (100.00%)        537082210 bases (100.00%)
Low quality discards:           15345206 reads (100.00%)        537082210 bases (100.00%)
Total Removed:                  15345206 reads (100.00%)        537082210 bases (100.00%)
Filtered by header:             7991117 reads (100.00%)         279689095 bases (100.00%)
Low quality discards:           7991117 reads (100.00%)         279689095 bases (100.00%)
Total Removed:                  7991117 reads (100.00%)         279689095 bases (100.00%)
Filtered by header:             7977917 reads (100.00%)         279227095 bases (100.00%)
Low quality discards:           7977917 reads (100.00%)         279227095 bases (100.00%)
Total Removed:                  7977917 reads (100.00%)         279227095 bases (100.00%)
Filtered by header:             14359040 reads (100.00%)        717952000 bases (100.00%)
Low quality discards:           14359040 reads (100.00%)        717952000 bases (100.00%)
Total Removed:                  14359040 reads (100.00%)        717952000 bases (100.00%)
Filtered by header:             12269903 reads (100.00%)        613495150 bases (100.00%)
Low quality discards:           12269903 reads (100.00%)        613495150 bases (100.00%)
Total Removed:                  12269903 reads (100.00%)        613495150 bases (100.00%)
Filtered by header:             14601226 reads (100.00%)        730061300 bases (100.00%)
Low quality discards:           14601226 reads (100.00%)        730061300 bases (100.00%)
Total Removed:                  14601226 reads (100.00%)        730061300 bases (100.00%)
Filtered by header:             13348244 reads (100.00%)        667412200 bases (100.00%)
Low quality discards:           13348244 reads (100.00%)        667412200 bases (100.00%)
Total Removed:                  13348244 reads (100.00%)        667412200 bases (100.00%)
bbduk RNA-seq • 1.8k views
ADD COMMENT
1
Entering edit mode
2.4 years ago
GenoMax 147k
Low quality discards:           13348244 reads (100.00%) 

Quality based filtering is rarely needed unless your data is of super low quality. Remove the quality filter options from your command so data can actually be scanned and trimmed.

What is ways= in your command? That is not a valid bbduk option.

If the data is really poor quality then you have a bigger problem that can't be fixed by trimming.

ADD COMMENT
0
Entering edit mode

Hi GenoMax, \ Many thanks for your reply. In the searches in papers they always show parameters of more or less:

qtrim=lr trimq=20 (or more) or more maq=20 (or more) minlen=20 (or more)

but now I put these values and the results were better:

qtrim=lr trimq=6 maq=10 minlen=15

In relation to your question. I fixed that error now. It was indicating the number of processors but I don't know how it was working.

I would be very happy if you tell me what you think about it.

Many thanks again

ADD REPLY
1
Entering edit mode

ways= should be threads= if you want to use more than one core.

ADD REPLY
0
Entering edit mode

Many thanks, GenoMax! You are the best!

ADD REPLY
1
Entering edit mode
2.4 years ago

I agree with GenoMax that you should step back for a moment and ask yourself if all those parameters are needed? Working in data mining to me implies that you intend to process many samples of different origins and from various sequencing centers with said command?

While all of this may be justified to "rescue" a specific sample when alignment stats are substandard, I would not recommend to pre-process all samples accordingly.

With that being said, let's try to troubleshoot your current issue:

  • How does the quality distribution of your test sample look like when you plot the outputs of qhist and aqhist? (Add qhist=qhistdata.txt aqhist=aqhistdata.txt to your run or use e.g. FastQC). Technically, it is possible that really poor sequencing data ends up in a repository.
  • What is the purpose of the ftm=5 parameter? I never used that one before and admittedly also don't understand the explanation If positive, right-trim length to be equal to zero, leftmost X bases. Did you try to omit it already with the same result?
  • What is your rationale to include qskip=4?
ADD COMMENT
0
Entering edit mode

Hi Matthias Zepper, many thanks for your reply. Yes, I am working with amples of different origins and from various sequencing centers. The data is the NCBI database SRA. I do not know what is the quality distribution of qhistdata.txt and aqhistdata.txt but I can to see now. I don't understand that parameter either but I read it in published articles, that's why I put it. Maybe it's better to remove it. This parameter qskip=4 is to improve the speed. Many thanks again!

ADD REPLY
1
Entering edit mode

Well, I do not know a single author of bioinformatic software, who deliberately puts some sleep / wait etc. commands into the code just to make the software slower by default.

So if qskip=4 results in faster run times, it is a side effect of modifying the default algorithm and changing what the software does. If you don't understand what a parameter changes and have not assessed that there are good reasons to divert from the default based on your specific data set, then please don't set it.

Just because it has been used in published articles doesn't mean a thing. Sequencing technology and e.g. base call quality has evolved over time such that settings from earlier times might no longer be best practice. Also consider that reviewers usually do not care too much about the methods section of an article and that the main authors are usually the ones with the wet lab experience, who may during manuscript preparation just have copy & pasted the command from an e-mail that their collaborating bioinformatician had sent them years ago.

ADD REPLY
0
Entering edit mode

Hi Matthias Zepper, many thanks for your reply. You're right. I'm going to change my code and see if just removing adapters gives me better results. Thank you very much again!

ADD REPLY

Login before adding your answer.

Traffic: 2264 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6