Removing adapter/primer sequence from reads while keeping flanking regions
3
0
Entering edit mode
8 months ago
Bryan ▴ 10

hi folks,

I have some metagenomics data that was generated using an amplification protocol that integrated a known primer sequence into random sites throughout genomes in the sample. I'd like to selectively remove the primer sequence and keep all flanking regions of the reads, a la deleting the primer sequence, but not trimming the read, which is done separately.

An example:

Raw read

sequence1ADAPTERsequence2

Desired result:

sequence1sequence2

I haven't figured out a way to do this with cutadapt (though would love any advice I've missed) and I know BBDuk can mask the adapter sequence with Ns, but that isn't exactly what I'm looking for either. Any suggestions welcome!

thanks!

trimming fasta cutadapt BBDuk fastq • 1.2k views
ADD COMMENT
0
Entering edit mode

If you have input fastq files, you should be able to do using cutadapt-

cutadapt -a ADAPTER -o output.fastq input.fastq

Reference

ADD REPLY
0
Entering edit mode

This ADAPTER is in the middle of the read. OP wants to remove that and join the left and right pieces.

ADD REPLY
0
Entering edit mode

See if this helps: https://bioinf.shenwei.me/seqkit/usage/#amplicon

No standard scan/trim program is likely going to do this. Generally all sequence 3' of where the initial match is made to adapter is trimmed.

ADD REPLY
1
Entering edit mode
8 months ago

seqkit amplicon can only keep either 5' or 3' flanking sequence.

How about

seqkit replace -s -p ADAPTER

Assuming there's only one adapter sequence in each read.

$ echo -ne ">s\nsequence1ADAPTERsequence2\n"
>s
sequence1ADAPTERsequence2
$ echo -ne ">s\nsequence1ADAPTERsequence2\n" \
    | seqkit replace -s -p ADAPTER
>s
sequence1sequence2
ADD COMMENT
0
Entering edit mode

Will seqkit replace do the same kind of fuzzy matching that cutadapt will? At a glance, it doesn't look like it.

ADD REPLY
0
Entering edit mode

No fuzzy matching, this command only supports regular expression.

ADD REPLY
0
Entering edit mode

shenwei356 This should work if the adapter sequence is constant? If OP confirms then we can move this to an answer.

ADD REPLY
0
Entering edit mode
8 months ago

It's kind of clunky, but you could use cutadapt to remove the adapter and everything after, save that to a fastq, use cutadapt to remove the adapter and everything before, save that to a second fastq, then join the fastqs together. (check for matching names, since the two files will probably not be in sync). Not sure what program will join fastqs like that, but you could write a script in something to do that.

ADD COMMENT
0
Entering edit mode
8 months ago
Bryan ▴ 10

thanks for the speedy replies, all! I ended up just resorting to my roots with sed, which produces nearly identical results to shenwei356 seqkit replace answer. As neither supports degeneracy, I'm not sure that either stands out as more optimal, aside for some efficiency concerns, which I didn't evaluate.

$ echo -ne ">s\nsequence1ADAPTERsequence2\n" \
> | sed 's/ADAPTER//g'
>s
sequence1sequence2

I will note that this sed command performs global substitution, so it will replace multiple ADAPTER instances per read, if found, whereas it looks like shenwei356 answer will remove a single instance, in case that matters for folks referring to this in the future.

ADD COMMENT
0
Entering edit mode

I moved your and Shenwei's comments to an answer. If you want to accept both to provide closure (green check mark) to this thread that would be great.

While sed is fine when you work with defined format like fastq there may be unexpected glitches with sed since it does not understand fastq format. shenwei356 's tool on the other hand does.

Is seq1ADAPTERseq2ADAPTERseq3 an edge case or do you actually expect to see that kind of construct. Because if not this is an erroneous read and should be thrown away.

ADD REPLY
0
Entering edit mode

thanks for cleaning this up, GenoMax . It's possible to have a situation like seq1ADAPTERseq2ADAPTERseq3, but that would probably only occur in a pretty low biomass sample for this protocol and would need some extra QC anyway. I accepted shenwei356 answer since seqkit does understand fasta/q as you pointed out. Thanks all!

ADD REPLY

Login before adding your answer.

Traffic: 4277 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6