I have paired fastq files that I need to run through a variant detection pipeline. The only information I have on the files are what I can glean from the sequence identifiers. The files are quite large (~40G per direction), and upon closer inspection, I discovered that the files have several run ids.
@E00566:94:HLFT5CCXY:8:2224:9993:9660
@E00566:93:HM5CHCCXY:1:1101:10003:10275
I think it best to split the files based on run ids, define readgroups based on BI's documentation (see below), run through the steps to get to a gvcf file, then merge all the gvcf files (that are related to the same sample). However, I am concerned about marking duplicates across multiple runs.
The Broad Institute kinda implies this in their updated readgroup description (https://gatk.broadinstitute.org/hc/en-us/articles/360035890671), but this (C: Adding read group to bam files from multiplexed samples) implies something else.
If someone could weigh in on the appropriateness of my approach as well as my duplicates concerns I would be more than grateful.
Was the sample library/pool sequenced on multiple flowcells?
Yes. I believe this can be determined from looking at the sequence identifiers (in bold) as well as in different lanes (though as a standalone, I don't think the lanes are too important). @E00566:94:HLFT5CCXY:8:2224:9993:9660 @E00566:93:HM5CHCCXY:1:1101:10003:10275
No. Those are just flowcell barcodes. Do you know if the same library/pool ran on both of those flowcells? If so you could consider those runs as technical replicates.
Ohhh...that I do not know, nor can I determine it or ask anyone. The only things I can glean from the files are what is listed in the sequence headers. Giving that I don't have this information, would you suggest splitting as I mentioned previously?
Do you think someone merged files from multiple runs because they were technical replicates to begin with? Otherwise that sounds like a strange thing to do. It is certainly not making your life any easier.
Darn tootin'! I really don't know. Nor do I know how many hands the files passed through before they got to me. I do know that the investigator wanted very high coverage (very rare disease).
Then I would tend to think that these are tech replicates but one can't be sure if you have no evidence. You could separate the files and look for common SNP's to confirm.
Woof. Having said that, is the best recommendation I have heard. Thank you @genomax.