Question

are illumina fastq files typically split by lane?

0

Entering edit mode

8 months ago

npb27 • 0

I have inherited an old dataset, but the details of what file corresponds to what exactly has been somewhat lost in the mist of time

I've managed to decipher most of the details, but there's one niggling issue I wanted to check

Each folder I've received corresponds to a single sample, and each contain 2X fastq files, with identical file names except for a "1" and "2" in the file name. My initial thought was paired end data, but there's no reason to have used PE in this context, the person who collected the data doesn't think it is, and the headers in the fastq are all number "1" in the pair so I don't think that's it

Here are example headers from the fastq:

file 1:

@D00261:443:CBE42ANXX:1:1104:1199:2080 1:N:0:ACAGTG

file 2:

@D00261:443:CBE42ANXX:2:2201:1155:2032 1:N:0:ACAGTG

So if my interpretation is correct, it seems they were generated in the same sequencing run, in different lanes. Is this typical? I haven't come across it before. Do/did Illumina machines spit out fastq files split by lane?

I'm hoping I can assume that because the run and index are the same, the files are definitely from the same sample

Any help is very much appreciated

Illumina fastq • 1.0k views

ADD COMMENT • link updated 8 months ago by swbarnes2 14k • written 8 months ago by npb27 • 0

score 3 · Answer 1 · 2024-08-22

This very much depends on the person who created the fastq files. Some people prefer to split them by lanes, others dont. In your example, it seems like they indeed have the same flowcell, but different lanes (lane 1 and lane 2). More details about Illuminas fastq header can be found here https://help.basespace.illumina.com/files-used-by-basespace/fastq-files

score 2 · Answer 2 · 2024-08-22

If the data is being processed locally (i.e. not in BaseSapce) then by default Illumina data files are split by lanes unless an explicit option is added to the SampleSheet file with bcl-convert. Creating single files per sample (single-end or paired-end) also requires a slightly different format samplesheet (with only one set of samples from the pool instead of one set for each lane).

score 2 · Answer 3 · 2024-08-22

In the Illumina naming convention, that number after the flowcell ID is the lane. And yes, it is up to the person making the fastqs as to whether the lanes are concatenated into one file or split. Unless there is a glaring technical problem, like a bubble, lanes do not introduce technical artifacts; you can combine them together no problem. You can just cat the files, even gzipped.