Should the LB field in the SAM specification refer to the library preparation for the sample, or the library preparation carried out by the sequencing centre? Say I have a sample sequenced on multiple lanes of a single flowcell/machine, should they have the same library name? Or what if I have a sample which was sequenced on one lane/flowcell/machine on a certain date, and then sequenced again on a different lane/flowcell/machine. Would the reads from these two runs have the same library name?
My question arises because normally when I want to remove duplicates from multiplexed samples (all sequenced on the same machine/date) I just align the FASTQ files separately, then merge BAM files belonging to the same sample and run MarkDuplicates on the merged BAM. However I recently contacted GATK to ask whether read group information was necessary in this context and the answer was yes (http://gatkforums.broadinstitute.org/gatk/discussion/9310/read-group-information-required-for-markduplicates).
This confused me because if your sample was produced from a single library then merging and duplicate removal based on the 5' position alone should remove all duplicates (optical and library)?
I have faced a similar problem in the past. From what I know MarkDuplicates looks for duplicates within reads that belong to the same read group (RG), possibly checking the library part of the RG. All the data from the same library should have the same library in the RG. However, when you analyse your data in pieces you may find that at the end the RG field does not reflect the correct information. There are different ways to solve this. For example, if you are aligning with bwa you can ask it to include a proper pre-specified RG field.