My concern here is the possibility of existing adapter dimers, where the 3'-adapter sequence is expected to be located at the 5'-end. If I use the "-b" command, for these adapter dimers, for what i understood, the cutadapt will consider these sequences as 5'adapters instead of 3'-adapters and, thus, will NOT trim the sequence that is following the adapter. Is this correct?
Let's find out! The documentation for the "5' or 3' adapters" feature says:
The decision which part of the read to remove is made as follows: If there is at least one base before the found adapter, then the adapter is considered to be a 3' adapter and the adapter itself and everything following it is removed. Otherwise, the adapter is considered to be a 5' adapter and it is removed from the read, but the sequence after it remains.
So it sounds like it'll only be one end or the other, like you said. Checking with ATCCCGGATGTT
as the the adapter on one/the other/both ends of a random 50 nt sequence, that is the behavior I see:
$ cat input.fa
>seq1 adapter at 3'
TCGTGAGGCGGCACAAATTGCGCGAGGCAAGAGTATTAGAAGCCTACAGGATCCCGGATGTT
>seq2 adapter at 5'
ATCCCGGATGTTTCGTGAGGCGGCACAAATTGCGCGAGGCAAGAGTATTAGAAGCCTACAGG
>seq3 adapter at both
ATCCCGGATGTTTCGTGAGGCGGCACAAATTGCGCGAGGCAAGAGTATTAGAAGCCTACAGGATCCCGGATGTT
$ cutadapt --quiet -b ATCCCGGATGTT input.fa -o -
>seq1 (3' adapter removed)
TCGTGAGGCGGCACAAATTGCGCGAGGCAAGAGTATTAGAAGCCTACAGG
>seq2 (5' adapter removed)
TCGTGAGGCGGCACAAATTGCGCGAGGCAAGAGTATTAGAAGCCTACAGG
>seq3 (only 5' adapter removed)
TCGTGAGGCGGCACAAATTGCGCGAGGCAAGAGTATTAGAAGCCTACAGGATCCCGGATGTT
Like GenoMax said, wouldn't the most straightforward approach be to let the program remove the adapter (wherever it's found) and everything 3' of it? For that you just need cutadapt's regular -a
flag for a typical 3' adapter:
$ cutadapt --quiet -a ATCCCGGATGTT input.fa -o -
>seq1
TCGTGAGGCGGCACAAATTGCGCGAGGCAAGAGTATTAGAAGCCTACAGG
>seq2
>seq3
Coupled with a length filter, you could remove the resulting empty sequences:
$ cutadapt --quiet -m 1 -a ATCCCGGATGTT input.fa -o -
>seq1
TCGTGAGGCGGCACAAATTGCGCGAGGCAAGAGTATTAGAAGCCTACAGG
(See the filtering documention for more options, like --discard-untrimmed
.)
Generally once a core sequence is found trimming programs will remove the entire sequence 3' of that core including it. So adapter dimers should be addressed that way.
You have tagged this smallRNAseq. Many kits use a specific adapter to ligate to small RNA. Unless that adapter is present (which would be on 3-end) you may not have a real smallRNA.
If you are willing to try a different program then I recommend
bbduk.sh
(LINK) orfastp
.