Hello,
I am trying to filter a fastq file, I ran fastqc to get a quality report and I get an overrepresented sequence:
sequence: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN count: 39317 percentage: 0.13862182817994162
The fastq file has 28362777 sequences and the read length is 125.
I used cutadapt (fastx toolkit) to remove it:
gunzip -c SRR9667734_S_sp_2.fastq.gz | cutadapt -m 20 -e 0.1 -z -a NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN - -o SRR9667734_S_sp_cutadapt_2.fastq.gz
but the resulting file still has those overrepresented sequences and the number of sequences in the fastq file was reduced to 68122 after running cutadapt.
Overrepresented sequences:
sequence: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN count: 39317 percentage: 57.71556912597986
sequence: ANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN count: 1172 percentage: 1.7204427350929214
sequence: GNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN count: 1014 percentage: 1.488505915856845
sequence: CNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN count: 895 percentage: 1.3138193241537244
sequence: TNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN count: 864 percentage: 1.268312733037785
Any idea of what's happening?
Not answering your question but you can try
bbduk.sh
from BBMap suite withmaxns=-1 If non-negative, reads with more Ns than this (after trimming) will be discarded
option to remove reads with N's.For starters, maybe put the -o option before the input. And I'm pretty sure cutadapt can handle gzipped files, so no need to decompress.