I am doing some small RNA analysis to identify a number of small RNAs in cancer data. The data was multiplexed and I have been given the demultiplexed reads. I was given 27 FastQ files for 9 samples (6 tumour and 3 normal) run on 3 different flow cells and lanes where each sample has the same index for the 3 lanes. I was confused whether to merge sam files after alignment and if so, what way to merge them - should it be by patient or by condition, if a patient has provided samples of both conditions how should I proceed?
run on 3 lanes where each sample has the same index for the 3 lanes.
That is technical sequence replication. You can merge those lane specific files for each sample at any step (before or after alignment). It is possible to generate files that are not split by lanes when a large pool runs on multiple lanes (which was not done in your case).
Thank you for the reply. I should have clarified earlier but the data I was given is a subset of a larger dataset where there are 19 pools with 45 samples within run on multiple flowcells and lanes. The data I was given was 27 FastQC files from within one pool. Is it still appropriate for me to merge specific lanes for each sample?
is a subset of a larger dataset where there are 19 pools with 45
samples within run on multiple flowcells and lanes.
That is not enough information there to comment. If a sample library with one index was run in different combinations then it would still be technical sequence replication for that library. If there are multiple libraries for the same sample with different indexes then it is a library prep replicate.
The data I was given was 27 FastQC files from within one pool.
As I said before if a large pool ran on multiple lanes of a FC then you are going to get lane specific files for each sample (unless --no-lane-splitting option is used for bcl2fastq). So for that particular pool as long as it ran on one flowcell, it should be ok to merge lane specific files for that one flowcell.
That may depend on if it is the same library being run across many flowcells (and different pools) and your ultimate aim of analysis. If you are simply going in for great depth (and don't care about potential batch effect) then you could merge across runs/flowcells. You could also use read groups and keep tabs on runs, if you choose to merge.
Also, if the sample libraries with different indexes are in different
pools are these still considered library prep replicates?
That would mean that the libraries were independently made (starting with same sample) so yes. This is sometime done if there is a question about which index combination(s) work well for library prep/sequencing.
Thank you for the reply. I should have clarified earlier but the data I was given is a subset of a larger dataset where there are 19 pools with 45 samples within run on multiple flowcells and lanes. The data I was given was 27 FastQC files from within one pool. Is it still appropriate for me to merge specific lanes for each sample?
That is not enough information there to comment. If a sample library with one index was run in different combinations then it would still be technical sequence replication for that library. If there are multiple libraries for the same sample with different indexes then it is a library prep replicate.
As I said before if a large pool ran on multiple lanes of a FC then you are going to get lane specific files for each sample (unless
--no-lane-splitting
option is used forbcl2fastq
). So for that particular pool as long as it ran on one flowcell, it should be ok to merge lane specific files for that one flowcell.ok, just to double check, no merging across different flowcells?
Also, if the sample libraries with different indexes are in different pools are these still considered library prep replicates?
That may depend on if it is the same library being run across many flowcells (and different pools) and your ultimate aim of analysis. If you are simply going in for great depth (and don't care about potential batch effect) then you could merge across runs/flowcells. You could also use read groups and keep tabs on runs, if you choose to merge.
That would mean that the libraries were independently made (starting with same sample) so yes. This is sometime done if there is a question about which index combination(s) work well for library prep/sequencing.
that's great, thank you!