Question

Illumina paired end fastq sequence identifiers and index primers

0

Entering edit mode

3.4 years ago

wormball ▴ 10

Hello!

I have some paired end illumina fastq files. In most of these the sequence identifiers are like this:

....
@GENOTEK:000:311CE525F:3:1101:17996:1000 1:N:0:TCTTCACA+ATTACTCG
@GENOTEK:000:311CE525F:3:1101:21938:1000 1:N:0:TCTTCACA+ATTACTCG
@GENOTEK:000:311CE525F:3:1101:1208:1016 1:N:0:TCTTCACA+ATTACTCG
@GENOTEK:000:311CE525F:3:1101:3558:1016 1:N:0:TCTTCACA+ATTACTCG
....

So as i can understand TCTTCACA+ATTACTCG constitutes first and second index primers which are attached to the fragment to differentiate one end from another.

But at least one pair of files has identifiers like this:

....
@GENOTEK:000:9589D2457:7:1101:12895:1362 1:N:0:NTTACTCG
@GENOTEK:000:9589D2457:7:1101:16011:1379 1:N:0:NTTACTCG
@GENOTEK:000:9589D2457:7:1101:17381:1432 1:N:0:NTTACTCG
....

....
@GENOTEK:000:9589D2457:7:1101:12895:1362 2:N:0:NTTACTCG
@GENOTEK:000:9589D2457:7:1101:16011:1379 2:N:0:NTTACTCG
@GENOTEK:000:9589D2457:7:1101:17381:1432 2:N:0:NTTACTCG
....

So it contains only one index primer, and moreover, it is equal at both ends. Does it mean it is impossible to distinguish one end of the fragment from another, so these are effectively single end reads?

And also all the files have run number 000. Is it the thing to worry about?

Thanks in advance.

fastq primers identifiers Illumina • 2.1k views

ADD COMMENT • link updated 3.4 years ago by GenoMax 147k • written 3.4 years ago by wormball ▴ 10

score 3 · Accepted Answer · 2021-06-17

@GENOTEK:000:9589D2457:7:1101:12895:1362 1:N:0:NTTACTCG <--- This set of data is using a single index.

@GENOTEK:000:311CE525F:3:1101:3558:1016 1:N:0:TCTTCACA+ATTACTCG <-- This dataset is using two indexes

In Illumina sequencing index reads are never part of actual sequence and are read independently. This has nothing to do with distinguishing one end of fragment from another. If you have paired-end sequencing data then you are sampling each fragment from both ends. If you have single end sequencing data then the fragment is sampled from only one end. In both cases you can have a single index or two indexes. Indexes are simply being used to label samples to allow bioinformatic read separation after the run.

And also all the files have run number 000. Is it the thing to worry about?

That should not be a cause of worry. My assumption is that the name may have been changed afterwards.