Good evening
My question is whether it is more appropriate to feed Salmon a single concatenated fastq file or multiple sequencer- and read-length-specific fastq files when the reads in the fastq file (or files) have been generated at different times with different sequencing lengths for a given sample. The Salmon documentation is clear enough regarding Salmon's ability to accommodate concatenated fastq files from a single library, but I'm concerned about the effect of varying read lengths on the quantification process.
My motivation for this question is that I have a dataset generated over several years wherein certain samples with insufficient read depth were sequenced multiple times, and the different sets of reads were concatenated into single, sample-specific fastq files. I could load the single, concatenated fastq file for a given sample into Salmon, or I could decompose the fastq file for a given sample into multiple sequencer- and read-length-specific fastq files, and then load them separately into Salmon. (I could also decompose them and then load them together into Salmon using the referenced multiple read file approach, but I will resist that temptation.) My concern with the first (single file) approach is that Salmon would apply a quantification scheme to all reads that is only applicable to a subset of the reads. My concern with the second (multiple file) approach is the converse; multiple schemes will be applied when a single scheme would be more appropriate.
If I use the first (single file) approach, I think I should at least shuffle the reads (per read order section). If I use the second (multiple file) approach, should I use the same or different indices (with different k values most appropriate for read length)? I am using Salmon in non-alignment-based mode with a quasi-mapping-based index.
I would quantify every run separetely and then do a couple of diagnostics (PCA, correlations) to ensure that there are no confounding effects due to sequencing machine/center. kmer length in my experience is not too much of a factor, mapping% will once change slightly (see e.g. Salmon Quantification for RNA-seq Read Pairs with Different Lengths ), given that read length >= kmer length. I would use the same length for all files.
Thank you, ATpoint. Your comments are helpful, as is the link to the very informative discussion on read pairs with different lengths. I regret not finding it when I initially started searching for questions related to mine. Thank you for the link.