If not stated in the BAM header/metadata, is there a way to ascertain from a BAM file which minor build of the human reference genome was used to generate that BAM file?
I've encountered a BAM file that was aligned using a minor build of GRCh38 but I'm unable to determine which specific minor build (such as whether it was GRCh38.p2, GRCh38.p3, GRCh38.p4, GRCh38.p5 or GRCh38.p6).
Attempted to implement Istvan's suggestion above but hit a wall and was not able to get it to work. Any other suggestions for a way to ascertain from a BAM file which minor build of the human reference genome was used to generate that BAM file?
What data do you have in the head of your BAMs? Looking at http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/data/ i see that the total lengths, etc change between minor builds (panel on the right). Maybe there's a way to use that?
If we could see the
samtools view -H
of your BAM file that might give us more hints. For example, the BAMs I produce via Picard have the MD5 for each contig there, and a whole bunch of obscure scaffold names.But really, the way Istvan described is really the best way, since often the sequence is simply updated between patches, not the total chromosome size, etc.
Thanks for the info. The problem is that this isn't just one BAM file as I'm working on a universal solution regardless of what tool was used for alignment.
While the contig MD5 approach may work for Picard I'm not sure if it is applicable to other common aligners.
Ah, well, Picard doesn't do the aligning, it just added the MD5 information afterwards (I think during the MergeBam step). But of course you also have to provide it the reference sequence. I map with 3 different mappers, and they all have this M5 field per contig in the header.
Anywhoo - not relevant if it's not already there. The only thing to do in this case is would be to download every single minor build, align them all to one another as best you can. Find "informative sites" where minor builds differ. Compare to the BAM. It's certainly not an easy problem to solve.. perhaps a graph genome will help you here - where deviations from the latest reference genome is tagged with the minor version.