Find all sequences with adapters without trimming them off
3
1
Entering edit mode
4 months ago
Freyda ▴ 10

I have a lot of sequences in a file that are all 8bp long. I only want the sequences that resemble GTNNNNTG. I thought about using cutadapt to filter all sequences with the linked adapter option:

cutadapt -a ^GT...TG --discard-untrimmed

However this removes all the GT and TG bases from each end but I want to keep them on the reads so I end up with a file with 8bp long sequences that all match GTNNNNTG.

Anyone have any other ideas how I could do this?

cutadapt fastq • 1.1k views
ADD COMMENT
4
Entering edit mode
4 months ago
Jesse ▴ 850

Would the "retain" action (or possibly "none" would be equivalent here) do what you want when combined with the --discard-untrimmed?

With --action=retain, the read is trimmed, but the adapter sequence itself is not removed. Up- or downstream sequences are removed in the same way as for the trim action. For linked adapters, both adapter sequences are kept.

ADD COMMENT
0
Entering edit mode

Yes this does the job! Thanks so much

ADD REPLY
0
Entering edit mode
4 months ago

I think you can do it by setting the --info-file option and then parsing the resulting file to get the reads of interest, see info file format. To get fastq format from the info-file should be straightforward. If I understand your question correctly, from the info file you want columns:

1 Read name

6 Sequence of the read that was matched to the adapter

10 Quality values corresponding to sequence matched to the adapter (can be empty)

(Make sure you don't need to reverse complement)

ADD COMMENT
0
Entering edit mode

Thanks for the suggestion. Unfortunately, it looks like --info-file cannot be used with linked adapters :(

ADD REPLY
0
Entering edit mode

Can you explain what you mean by "linked adapters" and what error (if any) you get from cutadapt?

ADD REPLY
0
Entering edit mode

Linked adapters is what cutadapt calls it when you use a 5' and 3' adapter at the same time (so in my case GT is the 5' adapter and TG is the 3' adapter (represented by GT...TG given as an option).

I didn't run it but I read the manual and it says: Linked adapters do not work in combination with --info-file, --action=mask and --action=crop.

I guess I could do it twice, once for each "adapter" but I was hoping for a simpler solution

ADD REPLY
0
Entering edit mode

It would help if you posted an example with a handful of reads showing what it is you are trying to do. Just edit your question to add a few input fastq reads and the output you expect.

ADD REPLY
0
Entering edit mode
4 months ago

With BBTools, you can do this:

bbduk.sh in=reads.fastq out=unmatched.fastq outm=matched.fastq literal=GTNNNNTG copyundefined k=8 mm=f

However, your motif is not specific enough to be useful. It matched 77.05% of read pairs in some random data I'm playing with that have nothing to do with your adapters. That said, if you know that, for example, they only occur in the first 8bp of read 1, you could add "restrictleft=8 skipr2", which cuts it down to a 0.65% false-positive rate.

ADD COMMENT
0
Entering edit mode

But in the original question here everything's 8 nt long, so all else being equal for GTNNNNTG you'd expect it by chance only 1/4^4 = 1/256 = 0.4% of the time... or roughly the 0.65% you got for "restrictleft=8 skipr2", actually, so that all fits.

ADD REPLY
1
Entering edit mode

Actually, I forgot to add the flag "rcomp=f"; the command I gave will also match "CANNNNAC" which is probably why the number is a little high.

ADD REPLY

Login before adding your answer.

Traffic: 3385 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6