Question

index sequence in fastq header

0

Entering edit mode

3 months ago

hpapoli ▴ 150

Hello,

I've been just inspecting my Fastq files a bit and I have a question about the index sequence.

Given a fastq sequence header as follows:

@A00181:639:HNTFMDSX5:2:1101:1018:1000 1:N:0:ACACTAAG+TTATGGAT

I understand that ACACTAAG+TTATGGAT is a sequence index which differentiates samples on a flow cell. My first question is whether my understanding is right?

If so, wouldn't I expect that all reads in a given sample to have exactly the same sequence for their index? This is mostly the case except for example, I also see different indices here and there in the same fastq file such as ACACAAAG+ATATGGAT. Why is that the case?

Thanks so much for your help!

fastq • 435 views

ADD COMMENT • link 3 months ago by hpapoli ▴ 150

score 2 · Accepted Answer · 2024-09-20

My first question is whether my understanding is right?

Yes. It is actually a pair of indexes (+ separate the two indexes). This is a dual indexed sample. One can also have single indexed samples (there will be only one sequence in the header).

wouldn't I expect that all reads in a given sample to have exactly the same sequence for their index?

For a particular sample labeled with that index pair, yes.

I also see different indices here and there in the same fastq file such as ACACAAAG+ATATGGAT.

If the indexes differ by 1 or two bases then they are considered to be identical for the purpose of demultiplexing (allowing for sequencing errors). See hamming distance: https://en.wikipedia.org/wiki/Hamming_distance

Only other way that is possible it the file you have contains multiple samples and may need to be demultiplexed (assuming this is not single-cell data). There are methods such as demuxbyname.sh from BBMap suite and deML that can be used for demultiplexing the data.