I am creating a pipeline that will run some analysis on fastq.gz files. As part of my pipeline I need to check to see if the fastq files provided are paired end or single end reads. If they are paired end reads I run fast-join on them, otherwise they just get fed directly into my pipeline.
My question is what is the best way of auto-detecting paired end fastq files?
So far, this is the method that I've been using.
Start by looking at all the file names. In my case they look like this.
run1X1_220401_A00421_0429_AH3JCHDRX2_S1_L001_R1_001_subset.fastq
run1X1_220401_A00421_0429_AH3JCHDRX2_S1_L002_R1_001_subset.fastq
run1X8_220722_A00421_0459_AHH3JFDRX2_S8_L001_R1_001_subset.fastq
run1X8_220722_A00421_0459_AHH3JFDRX2_S8_L001_R2_001_subset.fastq
In this example set of names. Only sample 8 (the last two files) should be paired. This is indicated by the file names matching except for the R1
and R2
section.
My script compares all the file names. If there is only 1 difference between the two file names AND that one difference is between the R1
and R2
section those files are designated as paired end and will be fed into fast-join. So the sample 1 files (the first two files) wouldn't be paired because while there is only one difference between the file names, the difference isn't in the R
section.
This method works well enough for me now. But I don't know if it would work well with other fastq file name formats. I'm curious if there is a more standard approach. Any feedback/advice would be greatly appreciated.
It is unusual to have a mixed dataset like this. Does this actually indicate SE and PE mixed dataset?
You're right. Most of the time the data will either be all SE or PE. But there are some edge cases where this isn't the case. For example, the DNA samples were PE and the RNA samples were SE. I just want my pipeline to robust enough to handle any such case with little to no requirement of action on the end of the user.