I'm trying to trim/filter low quality reads from paired-end exome-seq data, using BBDuk.
I used the command:
for ea in $files;
do
R1="$ea"
R2=$(echo $R1 | sed "s/R1/R2/")
/home/shared/programs/bbmap/bbduk.sh -Xmx1g in1=$R1 in2=$R2 \
out1="$(echo $ea | sed s/.fastq.gz/_trimmed_filtered.fastq.gz/)" \
out2="$(echo $(echo $ea | sed s/R1/R2/) | sed s/.fastq.gz/_trimmed_filtered.fastq.gz/)" \
ref=/home/shared/programs/bbmap/resources/adapters.fa \
t=10 ktrim=r k=23 kmin=11 hdist=1 maq=10 minlen=60 tpe tbo
done;
After running fastqc on the output of this, I'm seeing that R2 files have some reads with low quality scores (see per sequence quality score), and the overrepresented sequence "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN", which should have been filtered out by quality filtering, no?
Any help here would be much appreciated.
Take a look at the quality associated with those reads? Perhaps it has other bases that are satisfying
maq=10
criteria.Are those pure N's? You may want to use this to filter reads with N's out.
Will try that, thanks.
Even so, though, shouldn't the plot for 'per sequence quality score' be zero for any quality score <10 based on the use of 'mapq=10'?
Edit: Looking at these sequences in the fastq:
...Which would suggest these sequences have average quality scores = 0.