Question

Trimmomatic or Cutadapt not Trimming my Adapters even though I see them clearly in the Fastq-File

0

Entering edit mode

7 months ago

obsto123 • 0

Dear Biostars community,

I am a Master's Student and trying to process my first RNA-Seq Data.

After inspecting my Fastq Files, I noticed that almost 90% of Reads had a total read through with a long Poly-G which is apparently from the NextSeq Sequencer specifically.

I first did:

cutadapt --nextseq-trim=30 --minimum-length=20

which got rid of most the Poly-Gs but not the Adapters at the Ends. Also, it did significantly improve the Quality Score.

Then I did cutadapt -b ADAPTER_REV_REVCOM -B ADAPTER_FWD_REVCOM --minimum-length=20 I double checked the Adapter Sequences by doing grep on the raw Fastq-Data and they are correctly added for Read1 (fwd) and Read2 (rev) in the command.

The thing is, this only got rid of the Adapters from the Read2 but the Adapters on the Read1 were all still there. I then checked the fastq files which were created, and I just used cat <file.fastq> | awk 'NR%4==2' | grep <(Partial) Adapter Sequence I supplied in the cmd line> and I got more than 100k Matches.

I also subsequently did trimmomatic ILLUMINACLIP:contam_file.fa:2:30:10 and the Adapter Sequences still remained. (The contam_file.fa I created myself which looks like this:

>Prefix/1
AATGATACGGCGACCACCGAGATCTACACGTAAGGAGACACTCTTTCCCTACACGACGCTCTTCCGATCT
>Prefix/2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTCTCCTTACGTGTAGATCTCGGTGGTCGCCGTATCATT
>Prefix/3
CAAGCAGAAGACGGCATACGAGATGTTCAACCGTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
>Prefix/4
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGGTTGAACATCTCGTATGCCGTCTTCTGCTTG

To note, the Adapter Sequences I find in my Fastq Files are not complete full alignments of the Sequence I supplied in the cutadapt cmd line but they overlap to at least 70% and there are close to non mismatches for the alignments. I though cutadapt uses by default a minimum overlapping threshold of 3, so this shouldnt be the issue. Also the 0,1 mismatches/Sequence Length should not be an issue.

I hope somebody can help me! Thank you and have a nice day :)

rna-seq cutadapt trimmomatic • 1.6k views

ADD COMMENT • link updated 7 months ago by Brian Bushnell 20k • written 7 months ago by obsto123 • 0

0

Entering edit mode

sorry for the weird syntax of the contents of the "contam_file.fa". It is in the correct format, somehow while copy-pasting it got transferred that way.

ADD REPLY • link 7 months ago by obsto123 • 0

0

Entering edit mode

Your file appears to be in correct multi-fasta format.

Please check out two other popular options that have easy to understand options.

fastp - https://github.com/OpenGene/fastp?tab=readme-ov-file#simple-usage
bbduk.sh from BBMap suite : Guide here https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/

ADD REPLY • link 7 months ago by GenoMax 151k

score 1 · Answer 1 · 2024-09-08

I am assuming you have included -o and -p options for paired end sequencing. You should always post the full command you use.

Your partial command:

cutadapt -b ADAPTER_REV_REVCOM -B ADAPTER_FWD_REVCOM --minimum-length=20

Try -a and -A instead of -b and -B

Here are the options I use:

cutadapt -j 15 --quiet -m 25 -q 25 -a file:$adapter -A file:$adapter -o $R1_out -p $R2_out $R1_in $R2_in

What is the output of multiqc over-represented sequences?

score 0 · Answer 2 · 2024-09-19

I just made big post about Illumina poly-Gs here:

New Illumina error mode, new BBTools release (39.09) to deal with it

It's slightly different from your problem since that post was specific to NovaSeqX and you are posting about NextSeq, which does not have this specific problem. However, the remedies will still work since they are designed generically.

Most of the time (prior to Illumina's big new screw-up with NovaSeqX) artificial poly-Gs on Illumina 2-dye systems are due to short insert reads, since once the sequencer reads off the end of the genomic insert, and then off the end of the adapter, there's just nothing, and that is correctly interpreted as G (a dark cycle on some Illumina 2-dye sequencers). So the first thing you need to do is adapter-trimming. You can do that with BBTools like this:

bbduk.sh in=reads.fq out=clean.fq ktrim=r k=23 mink=11 hdist=2 hdist2=0 ref=adapters tbo tpe

That should get rid of the majority of your poly-Gs. If that does not totally clean up your data, please look at the thread I posted above.

P.S. For people who don't want to follow the link (which I recommend, but it is kind of long) you can do one of these things:

bbduk.sh in=reads.fq out=clean.fq k=29 hdist=2 literal=GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG trimpolyg=6

and/or

polyfilter.sh in=reads.fq out=clean.fq

If you do both of them, do polyfilter first, since trimming messes up the poly-G detection. And be sure to use BBTools 39.09+ because BBDuk's poly-G trimming was improved there to ensure it trims poly-Gs that have intermittent non-G bases. For NextSeq that shouldn't matter too much but for NovaSeqX it's crucial.