Question

Trimming out primer sequences in the middle of reads

0

Entering edit mode

7.7 years ago

s.kyungyong64 ▴ 40

Hi!

I have PacBio reads that need to be assembled. These reads have Illumina primers at the both ends as well as in the middle. The problem is that the primer sequences vary and standard trimming cannot remove all the primers in the reads. My lab wants the assembled genome with the best quality, so I might have to write a script to detect the primers in the middle. I am currently thinking that I might want to remove sequences that are 80 ~ 100% similar to the primer sequences. But I am worried that this would also get rid of some informative sequences of the genome.

How do you guys deal with such situations?

Thank you in advance!

genome • 4.9k views

ADD COMMENT • link updated 7.7 years ago by Brian Bushnell 20k • written 7.7 years ago by s.kyungyong64 ▴ 40

1

Entering edit mode

7.7 years ago

dariober 15k

I don't have direct experience with the situation you describe but cutadapt is very flexible in how you want to detect, remove or mask one or more adapters. See for example the paragraph https://cutadapt.readthedocs.io/en/stable/guide.html#multiple-adapter-occurrences-within-a-single-read

If the adapter sequence you give in input is long enough, say > 15 nt, it's unlikely you will throw away informative sequence (roughly speaking, of course).

ADD COMMENT • link 7.7 years ago by dariober 15k

score 2 · Accepted Answer · 2017-02-11

2

Entering edit mode

7.7 years ago

Brian Bushnell 20k

I wrote a tool for removing internal PacBio adapter sequences, in the BBMap package:

removesmartbell in=reads.fq out=clean.fq split=t adapter=ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT

By default it uses the standard PacBio SmartBell adapters, but you can specify an Illumina adapter in this case. It uses indel-aware alignment designed to model PacBio's error rates of indels and substitutions, and has a very low false-positive rate. I don't remember the exact rate but I think it was around 1 in 5 megabases of PacBio sequence, or something like that. So it should not cause any problems downstream.

ADD COMMENT • link 7.7 years ago by Brian Bushnell 20k

0

Entering edit mode

Hello, does your script also remove the reverse complement? Do I find it within the BBmap scripts?

ADD REPLY • link 4.8 years ago by ricardoguerreiro2121 ▴ 80

1

Entering edit mode

You can include the RC sequence in adapter file or command line above.

removesmartbell in=reads.fq out=clean.fq split=t adapter=ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT,RC_Sequence

ADD REPLY • link 4.8 years ago by GenoMax 146k