Dear community, I want to filter out reads containing technical sequences.
Case: Uploading a assembled genome to NCBI gives me error messages that there are still technical sequences present. We identified this sequences (using the coordinates from NCBI) and created a multi fasta file with 38 fasta sequences (including reverse compliment). I trimmed my raw PE reads with trimmomatic, but I think that there are still technical sequences present in some of the reads (maybe also in the middle of the reads).
Question: I am looking for a tool in which I can provide paired end reads (file1.fastq, file2.fastq) and a multi fasta file containing various technical sequences. The tool should, if technical sequences are identified, remove the whole read (and also the corresponding paired read).
Thank you very much
Thanks @genomax. It seems there are also PCR primers besides adapters present. Is there an alternative approach to removing reads based on minlen= ? The problem is that my reads dont show a constant size but a whole sequence distribution (See picture). I am looking for something like If technical sequence is present -> remove read, regardless of length
length distribution
If you really want to be stringent about removing any read that gets trimmed then decide on
#
of bases you can afford to lose) and then set theminlen=(length of original read - # of bases you can lose)
. I suggest not doing "If technical sequence is present -> remove read, regardless of length" since you will lose a lot of good data that way.