Question

Adapter contamination: Removing (not trimming) PE reads containing technical sequences

0

Entering edit mode

6.7 years ago

palu • 0

Dear community, I want to filter out reads containing technical sequences.

Case: Uploading a assembled genome to NCBI gives me error messages that there are still technical sequences present. We identified this sequences (using the coordinates from NCBI) and created a multi fasta file with 38 fasta sequences (including reverse compliment). I trimmed my raw PE reads with trimmomatic, but I think that there are still technical sequences present in some of the reads (maybe also in the middle of the reads).

Question: I am looking for a tool in which I can provide paired end reads (file1.fastq, file2.fastq) and a multi fasta file containing various technical sequences. The tool should, if technical sequences are identified, remove the whole read (and also the corresponding paired read).

Thank you very much

sequencing Assembly • 1.8k views

ADD COMMENT • link updated 6.7 years ago by GenoMax 147k • written 6.7 years ago by palu • 0

GenoMax · Answer 1 · 2018-03-30

0

Entering edit mode

6.7 years ago

GenoMax 147k

I am not sure what technical sequences you are referring to but you can use bbduk.sh from BBMap suite to scan/trim these sequences. BBMap includes sequencing_artifacts.fa.gz and adapters.fa that contain adapters for all commercially available kits. You can always add sequences to need to additionally remove in fasta format to one of these files. You can use minlen= directive to throw away sequences that become shorter than a set length. Be sure to scan/trim your paired-end data files together (use options tpe tbo in that case).

ADD COMMENT • link 6.7 years ago by GenoMax 147k

0

Entering edit mode

Thanks @genomax. It seems there are also PCR primers besides adapters present. Is there an alternative approach to removing reads based on minlen= ? The problem is that my reads dont show a constant size but a whole sequence distribution (See picture). I am looking for something like If technical sequence is present -> remove read, regardless of length

length distribution

ADD REPLY • link updated 6.7 years ago by GenoMax 147k • written 6.7 years ago by palu • 0

0

Entering edit mode

If you really want to be stringent about removing any read that gets trimmed then decide on # of bases you can afford to lose) and then set the minlen=(length of original read - # of bases you can lose). I suggest not doing "If technical sequence is present -> remove read, regardless of length" since you will lose a lot of good data that way.

ADD REPLY • link 6.7 years ago by GenoMax 147k