Hi,
I am currently trying to write a pipeline to analyze ATAC sequencing data using Snakemake. I have had a good time learning about the modular + reproducible philosophy of Snakemake, its rules, generalization, etcetera.
I started with paired-end FASTQ files from an Illumina NovaSeq, which means I've got four files (two lanes). My simple analysis worked and then I started an attempt to turn it into a pipeline that I could potentially re-use every time I want to perform this type of analysis. However, I keep wondering How do I take into account all the different types of input FASTQ files that I might get? e.g. data from a NextSeq is going to come in eight files, instead of four, etcetera.
I've always had issues limiting the scope of my work, and I realize I might be falling into that trap here, but I am still curious, how do the pipelines that can deal with "everything" solve this?
Thanksss!
You define a variable
seq_tech
to specify the sequencing technology, and based on it you define the rest of parameters that are unique to each technology.Different sequencing technologies is not the same thing as different Illumina sequencers (which is what you seem to be mostly referring to). Ultimately every sample is going to have one file (or more than one, if it ran on multiple lanes). It is possible to simply
cat
those lane specific/multiple files together to create one pair of files (R1/R2) per sample for Illumina sequencers. Think of lane specific files as technical replicates of sequencing.If you had files from nanopore, PacBio and Illumina then they would indeed be from different technologies and you will need to process them differently, even using different programs to do alignments etc.