BBDuk.sh and BBDuk2.sh barcodefilter does not filter the reads
1
0
Entering edit mode
6.0 years ago
chilifan ▴ 120

I have a sample from illumina I want to filter so that only the reads containing the barcode TAAGGCGA are kept in an output file. I know that this barcode exists and that I have contamination from other barcodes through this bit of code:

#Barode for COL02
$ grep TAAGGCGA R1.fastq | wc -w
462670619

Barcode for COL04
$ grep CGTACTAG R1.fastq | wc -w
23463

 #Barcode for COL24
$ grep TCGACGTC R1.fastq | wc -w
37567

However, when I use BBDuk2.sh to filter on the barcode from COL02, no reads are removed and the output files are exactly the same size as the input files. What am I doing wrong?

Here is my input code and the terminal output, I have also tried with BBDuk.sh, I knw BBDuk2.sh is for piping.

   dell@dell-Precision-7920-Tower ~
$ /home/dell/proj/tools/bbmap/bbduk2.sh 
in=/home/dell/proj/data/Sample_R18049331-180719CT-180719CP1/R1.fastq
in2=/home/dell/proj/data/Sample_R18049331-180719CT-180719CP1/R2.fastq
out=/home/dell/proj/tools/bbmap/output/c1_R1.fastq 
out2=/home/dell/proj/tools/bbmap/output/c1_R2.fastq 
barcodes=TAAGGCGA barcodefilter=t 
-Xmx68g overwrite=TRUE stats=/home/dell/proj/tools/bbmap/output/c1_stats prealloc=t

java -Djava.library.path=/home/dell/proj/tools/bbmap/jni/ -ea -Xmx68g -Xms68g -cp /home/dell/proj/tools/bbmap/current/ jgi.BBDuk2
 in=/home/dell/proj/data/Sample_R18049331-180719CT-180719CP1/R1.fastq
 in2=/home/dell/proj/data/Sample_R18049331-180719CT-180719CP1/R2.fastq 
out=/home/dell/proj/tools/bbmap/output/c1_R1.fastq 
out2=/home/dell/proj/tools/bbmap/output/c1_R2.fastq barcodes=TAAGGCGA 
barcodefilter=t -Xmx68g overwrite=TRUE stats=/home/dell/proj/tools/bbmap/output/c1_stats prealloc=t

Executing jgi.BBDuk2 [in=/home/dell/proj/data/Sample_R18049331-180719CT-180719CP1/R1.fastq,
 in2=/home/dell/proj/data/Sample_R18049331-180719CT-180719CP1/R2.fastq, 
out=/home/dell/proj/tools/bbmap/output/c1_R1.fastq, out2=/home/dell/proj/tools/bbmap/output/c1_R2.fastq,
 barcodes=TAAGGCGA, barcodefilter=t, -Xmx68g, overwrite=TRUE, 
stats=/home/dell/proj/tools/bbmap/output/c1_stats, prealloc=t]
    Version 38.32

    Note - if this program runs out of memory, please disable the prealloc flag.
    Initial size set to 676157570

NOTE: No reference files specified, no trimming mode, no min avg quality, no histograms - read sequences will not be changed.
Initial:
Memory: max=73014m, total=73014m, free=72930m, used=84m

Input is being processed as paired
Started output streams: 0.012 seconds.
Processing time:        2244.830 seconds.

Input:                      440181546 reads         66027231900 bases.
Low quality discards:       0 reads (0.00%)     0 bases (0.00%)
Total Removed:              0 reads (0.00%)     0 bases (0.00%)
Result:                     440181546 reads (100.00%)   66027231900 bases (100.00%)

Time:                           2244.955 seconds.
Reads Processed:        440m    196.08k reads/sec
Bases Processed:      66027m    29.41m bases/sec

The head of my fastq file sequences looks like this:

enter image description here

BBMap BBDuk barcode illumina index fastq • 3.2k views
ADD COMMENT
3
Entering edit mode

Could it be that you are counting the existence of the barcode sequence in the read-sequence?

Try to grep "1:N:0:CGTACTAG" , or use awk to get only the first of the four lines (awk 'NR%4==1 {print}') .

Cheers,

Michael

ADD REPLY
0
Entering edit mode

hm, well, yes with grep I want to count the existance of the barcode sequence. It's just my way of checking that the barcode actually exists and that I should filter on barcode for col02, rather than filtering on any of the rest of the barcodes for other columns. I base that on the number of reads with that barcode.

But BBDuk should be able to actually extract the reads with that specific barcode to a new file. (I guess I could do it myself in bash also, but I assumed that BBDuk would be faster at it.) :) :)

ADD REPLY
2
Entering edit mode

Your sample has already been filtered (demultiplexed). The point by @michael.ante is that 'grep' matches both the barcode in the header as well as any reads that contain that sequence string. Notice that the value returned by grep (~460M) is larger than the number of reads in your file reported by BBDuk (~440M). Those other grep matches aren't barcodes - they're from the reads.

ADD REPLY
1
Entering edit mode

BBDuk is doing fine. The barcode is in the read id line:

@ST-E00161:..... 1:N:0:TAAGGCGA

BBDuk used according to your output barcodefilter=t ; the number of input reads are equal to the number of output reads. Therefore, you should not have a barcode contamination.

ADD REPLY
0
Entering edit mode

Hello berghannaf ,

please use the ADD REPLY button below the post you like to reply to.

Thanks.

fin swimmer

ADD REPLY
0
Entering edit mode

All of the above comments are correct, my sample has already been demultiplexed and the barcode is fetched by grep in the sequence. hence, I don't need to use BBDuk2.sh to filter for my barcode. Thank you for your help! :):):)

ADD REPLY
1
Entering edit mode
6.0 years ago
h.mon 35k

If your library preparation used a standard Illumina protocol and kit, the barcodes are not part of the reads, and are sequenced separately from them. In general, Real-TimeAnalysis automatically demultiplexes the samples and separates files according to barcodes. One can set up the software to output the barcodes, then the result would be R1.fastq (and maybe R2.fastq, if paired-end) and I1.fastq (and maybe I2.fastq, if dual-index).

BBDuk.sh barcodes=TAAGGCGA barcodefilter=t will look for this barcode on the header of the sequences. As they have already been demultiplexed, each file contain only one barcode, and BBDuk outputs the same number of reads as the input.

If your barcodes are added to the library as part of a custom protocol, they would probably appear at a fixed position on your reads - likely at the beginning. The grep command you used searches for the pattern at any position on your file (including the fastq headers), so you are not checking for barcodes - for that, you would need to look for the pattern at the expected location, and only look at the reads.

edit: a further comment:

On your command grep TAAGGCGA R1.fastq | wc -w, wc -wcounts the number of words from the stream, so when grep matches a line from the header ( @ST-E00161:630:HTCFNCCXY:2:1101:2270:2012 1:N:0:TAAGGCGA ) it will produce a count of 2.

grep "TAAGGCGA" R1.fastq | wc -l would count the number of lines from the stream (and it would be equivalent of just grep -c "TAAGGCGA" R1.fastq. Both commands count only the number of lines with matches to the pattern, but not the total number of matches to the pattern - if the pattern is matched more than once in a line, only one occurrence will be counted.

ADD COMMENT
1
Entering edit mode

Another note: OP might need to use grep -w to capture only those sequences that are part of the header (or at the least, not part of the sequence part). grep $SEQ | wc -w will produce a result of 1 for each sequence line where $SEQ is a subsequence of the larger sequence spanning the entire line.

ADD REPLY

Login before adding your answer.

Traffic: 2027 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6