I have a sample from illumina I want to filter so that only the reads containing the barcode TAAGGCGA are kept in an output file. I know that this barcode exists and that I have contamination from other barcodes through this bit of code:
#Barode for COL02
$ grep TAAGGCGA R1.fastq | wc -w
462670619
Barcode for COL04
$ grep CGTACTAG R1.fastq | wc -w
23463
#Barcode for COL24
$ grep TCGACGTC R1.fastq | wc -w
37567
However, when I use BBDuk2.sh to filter on the barcode from COL02, no reads are removed and the output files are exactly the same size as the input files. What am I doing wrong?
Here is my input code and the terminal output, I have also tried with BBDuk.sh, I knw BBDuk2.sh is for piping.
dell@dell-Precision-7920-Tower ~
$ /home/dell/proj/tools/bbmap/bbduk2.sh
in=/home/dell/proj/data/Sample_R18049331-180719CT-180719CP1/R1.fastq
in2=/home/dell/proj/data/Sample_R18049331-180719CT-180719CP1/R2.fastq
out=/home/dell/proj/tools/bbmap/output/c1_R1.fastq
out2=/home/dell/proj/tools/bbmap/output/c1_R2.fastq
barcodes=TAAGGCGA barcodefilter=t
-Xmx68g overwrite=TRUE stats=/home/dell/proj/tools/bbmap/output/c1_stats prealloc=t
java -Djava.library.path=/home/dell/proj/tools/bbmap/jni/ -ea -Xmx68g -Xms68g -cp /home/dell/proj/tools/bbmap/current/ jgi.BBDuk2
in=/home/dell/proj/data/Sample_R18049331-180719CT-180719CP1/R1.fastq
in2=/home/dell/proj/data/Sample_R18049331-180719CT-180719CP1/R2.fastq
out=/home/dell/proj/tools/bbmap/output/c1_R1.fastq
out2=/home/dell/proj/tools/bbmap/output/c1_R2.fastq barcodes=TAAGGCGA
barcodefilter=t -Xmx68g overwrite=TRUE stats=/home/dell/proj/tools/bbmap/output/c1_stats prealloc=t
Executing jgi.BBDuk2 [in=/home/dell/proj/data/Sample_R18049331-180719CT-180719CP1/R1.fastq,
in2=/home/dell/proj/data/Sample_R18049331-180719CT-180719CP1/R2.fastq,
out=/home/dell/proj/tools/bbmap/output/c1_R1.fastq, out2=/home/dell/proj/tools/bbmap/output/c1_R2.fastq,
barcodes=TAAGGCGA, barcodefilter=t, -Xmx68g, overwrite=TRUE,
stats=/home/dell/proj/tools/bbmap/output/c1_stats, prealloc=t]
Version 38.32
Note - if this program runs out of memory, please disable the prealloc flag.
Initial size set to 676157570
NOTE: No reference files specified, no trimming mode, no min avg quality, no histograms - read sequences will not be changed.
Initial:
Memory: max=73014m, total=73014m, free=72930m, used=84m
Input is being processed as paired
Started output streams: 0.012 seconds.
Processing time: 2244.830 seconds.
Input: 440181546 reads 66027231900 bases.
Low quality discards: 0 reads (0.00%) 0 bases (0.00%)
Total Removed: 0 reads (0.00%) 0 bases (0.00%)
Result: 440181546 reads (100.00%) 66027231900 bases (100.00%)
Time: 2244.955 seconds.
Reads Processed: 440m 196.08k reads/sec
Bases Processed: 66027m 29.41m bases/sec
The head of my fastq file sequences looks like this:
Could it be that you are counting the existence of the barcode sequence in the read-sequence?
Try to grep "1:N:0:CGTACTAG" , or use awk to get only the first of the four lines (
awk 'NR%4==1 {print}'
) .Cheers,
Michael
hm, well, yes with grep I want to count the existance of the barcode sequence. It's just my way of checking that the barcode actually exists and that I should filter on barcode for col02, rather than filtering on any of the rest of the barcodes for other columns. I base that on the number of reads with that barcode.
But BBDuk should be able to actually extract the reads with that specific barcode to a new file. (I guess I could do it myself in bash also, but I assumed that BBDuk would be faster at it.) :) :)
Your sample has already been filtered (demultiplexed). The point by @michael.ante is that 'grep' matches both the barcode in the header as well as any reads that contain that sequence string. Notice that the value returned by grep (~460M) is larger than the number of reads in your file reported by BBDuk (~440M). Those other grep matches aren't barcodes - they're from the reads.
BBDuk is doing fine. The barcode is in the read id line:
BBDuk used according to your output barcodefilter=t ; the number of input reads are equal to the number of output reads. Therefore, you should not have a barcode contamination.
Hello berghannaf ,
please use the
ADD REPLY
button below the post you like to reply to.Thanks.
fin swimmer
All of the above comments are correct, my sample has already been demultiplexed and the barcode is fetched by grep in the sequence. hence, I don't need to use BBDuk2.sh to filter for my barcode. Thank you for your help! :):):)