Hello,
I'm using cutadapt to trim adapter sequences from a small rna-seq dataset. However I'm getting a lot of very small reads after trimming with around 35% being 0 length reads. I'm using the following command:
cutadapt --discard-untrimmed -O 7 --minimum-length=18 --maximum-length=40 -a AGATCGGAAGAGC file.fastq > trimmed_file.fastq
With this settings I'm losing almost half of my data after trimming because they are becoming too short (<18). Am I doing something wrong? It's possible to get sequences reads without inserts (only adapter sequenced)?
Maybe I'm using the wrong sequence adapter, but from what read on foruns the sequence 'AGATCGGAAGAGC' is able to trim all adapters from Illumina sequencing, or am I wrong?
Thanks in advance.
ps: I have tried other overlap settings (3, 5 and 6) and the results are the same.
Try "Trim Galore!" with the default settings. If you get similar results then you did everything right (Trim Galore! is a wrapper around cutadapt).
BTW, you have fastQC as a tag, so if you have a lot of adapter contamination (likely the cause) then it'll show up there.
I have tried Trim Galore! with default settings and the results are the same. Running fastQC shows that there are a lot of overrepresented sequences, including "TruSeq Adapter, Index 7" and "Illumina Multiplexing PCR Primer 2.01" after trimming these sequences are gone, but still there's a lot of overrepresented sequences.
"It's possible to get sequences reads without inserts (only adapter sequenced)?"
Yes, and it's more common with small RNA libraries b/c purification by size selection is less effective. Insert-positive and insert-negative clones are similar in size and therefore difficult to resolve.
Follow Devon Ryan's recommendation for FastQC to detect adapter contamination. If it's 35%, that will be clearly visible in the per-cycle base graph and also flagged as over-represented kmers.
When I trim adapters are found in 94% of the reads, which I supposed it's normal when dealing with small RNA-seq, since the insert size are smaller than read length. So I guess that's probably "empty" reads and the only thing I can do is discard then, right?
Also, after trimming I'm getting reads from different length most peaking at 18 and 33. I know that for miRNAs most range from 20-23 and others sRNAs peak around 35. Is it possible that those 18nt reads are miRNAs?