Question

illumina adapter specifying and removing using fastp

0

Entering edit mode

3.5 years ago

Mehmet ▴ 820

Dear all,

Recently, I have been asked to do preprocessing of some fastq files produced by Illumina (I don't know which machine produced data).

This is information of a fastq file (forward);

@A00957:111:H5MTHDSX2:3:1101:2718:1063 1:N:0:TCCGCGAA+AGGCTATA CTGACCTCAAGTGATCTACCCACCTCGGTCTCCCAAAGTGCTGGGATTACAGGCAGGAGCCACTGCCCCTGGCCCTAATCATAGATCGGAAGAGCACACGTCTGAACTCCAGTCACTCCGCGAAATCTCGTATGCCGGCGTCTGCTTGAAA

when I asked adapter sequences from the company, they provided me them as D710-501 TCCGCGAATATAGCCT (This is for one sample of forward and reverse).

When I checked the header of the fastq file, it can be seen as TCCGCGAA+AGGCTATA

On the other hand, at Illumina's documentation the information is as below:

TruSeq DNA and RNA CD Indexes

Index 1 (i7) Adapters CTAGCGCT GTGTAGAC GATCGGAAGAGCACACGTCTGAACTCCAGTCAC[i7]ATCTCGTATGCCGTCTTCTGCTTG

I want to remove adapters from fastq files. I am a little bit confused about how to specify adapter sequences in an adapter file that will be used as input in fastp or Trimmomatic.

For example,

Is it okay to write as TCCGCGAATATAGCCT in the adapter fasta file or should I specify all? I mean like this (replacing i7 in the illumina documentation with sequences given at the header of the fastq file);

Read1 adapter;

GATCGGAAGAGCACACGTCTGAACTCCAGTCAC[TCCGCGAA]ATCTCGTATGCCGTCTTCTGCTTG

Read2 adapter;

GATCGGAAGAGCACACGTCTGAACTCCAGTCAC[AGGCTATA]ATCTCGTATGCCGTCTTCTGCTTG

adapter index illumina fastp fastq • 2.8k views

ADD COMMENT • link updated 3.5 years ago by GenoMax 148k • written 3.5 years ago by Mehmet ▴ 820

score 2 · Accepted Answer · 2021-08-07

There is a core sequence that is common to all illumina adapters. You can specify the core sequence when you are looking for adapters. So when trimming program finds GATCGGAAGAGCACACGTCTGAACTCCAGTCAC (for Read 1 adapter) it can simply remove the remaining sequence all the way to the 3'-end. Same thing for other adapter.

TCCGCGAA+AGGCTATA

Those are index sequences. They are independently sequenced in Illumina technology. Those reads are never a part of actual R1/R2 sequence.