Question

Illumina FASTQ FILTERING

0

Entering edit mode

2.1 years ago

Ranan Jyoti Sarma ▴ 100

Hello peers,

I need some help regarding QC of NGS Data.

I have some raw NGS data (More than 100 samples) in FASTQ format and I have trimmed adapter sequences using trimmomatic v0.39. The adapter sequences used for trimming were:

>PrefixPE/1
TACACTCTTTCCCTACACGACGCTCTTCCGATCT

>PrefixPE/2 
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT

When I ran fastqc after that to check the adapter sequence were cleaned. But when I try to check the presence of index sequences, these index sequences were still in the reads.

For example, I have given below the description line of one read of the forward read file (Read1) from one sample:

@F00740:29:HCHTCDRXX:1:1101:10574:1016 1:N:0:TCCGGAGA+GGCTCTGA

As described in the fast file description line, TCCGGAGA+GGCTCTGA dual index (8bp) was used. These index sequences are still in the trimmed reads. I checked it using:

grep "^[^@]" 2B_TruSeqCD_BHCHTCDRXX_R1.fastq| grep GGCTCTGA

Output:

Searching GGCTCTGA Index

AND

grep "^[^@]" 2B_TruSeqCD_BHCHTCDRXX_R1.fastq| grep TCCGGAGA

Output:

Getting TCCGGAGA index

As these index sequences are not biological origin, how to get ride of it? or we can still go for alignment? Shall I trim these sequences off?

Your expert advice (with or without supporting articles/documentation) will be appreciated.

Thank you!

NGS Genomics • 959 views

ADD COMMENT • link updated 2.1 years ago by Ram 44k • written 2.1 years ago by Ranan Jyoti Sarma ▴ 100

score 1 · Answer 1 · 2022-12-04

1

Entering edit mode

2.1 years ago

ATpoint 86k

The index sequence is a separate fastq file which is often not even distributed by the sequencing provider as you don’t need it. What you grep is just part of a normal read, a combination of 8 nucleotides can happen by chance in the genome. You’re fine, proceed with alignment.

ADD COMMENT • link 2.1 years ago by ATpoint 86k

score 0 · Answer 2 · 2022-12-04

Index reads are read independently of the main read in Illumina sequencing. Read order is Read 1 --> Index 1 (if present) --> Index 2 (if present) --> Read 2. The index reads are used during deultiplexing to bin main reads (R1 and R2) into sample specific files. In this process the index sequences are transferred to the header of the binned reads (they can be recovered as separate files since they may be needed in that form for rare protocols).

You do not need to do anything/worry about index sequences other than using them for sample identification.