Is it legit for BAMs to have @RG headers that are not used in the alignments? I am trying to decide whether I should be able to depend on the headers or if I have to instead check the alignments to see if it is used.
Is it legit for BAMs to have @RG headers that are not used in the alignments? I am trying to decide whether I should be able to depend on the headers or if I have to instead check the alignments to see if it is used.
If by "legit", you mean "passes the spec", then yes. I don't see anything in the sam/bam spec that requires @RG groups to actually appear in the body of the bam.
I can't think of a case where this would be proper behaviour, but I can think up a scenario where it might happen. Say a sequencing center's pipeline does a bunch of per-lane alignments, and one of the lanes fails spectacularly. It gets pushed into alignment anyway, but none of the reads align. The pipeline adds the @RG name to the header automatically. Then, some grad student gets the bright idea to save space by using a perl script to filter out all of the unmapped reads in the bam. Boom - you've got a bam with @RG names in the headers, but no reads.
Is this farfetched? Probably a little, but people do ridiculous and stupid things all the time in bioinformatics.
Perhaps more plausibly - someone splits up a bam into a separate file for each readgroup, but just copies the existing header over.
Oh, and I've definitely seen old bams (>2 years old) that don't even have @RG headers, so there's that to consider too.
Despite all that, I'd code things up initially expecting sane header behavior. If you are proved wrong at some later date (or are extremely bored one afternoon), then add in the sanity checking.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
I have a BAM that has no reads for one of the read groups so was trying to figure out whether to fix the code or the BAM. I guess I'm still not sure what to do but it seems like it's valid SAM so I should probably fix the code. Thanks.