Is there any other way to determine for a BAM file what human reference genome was used for alignment if this info is not able to be ascertained from the header of the BAM file? (We have come across some human BAM files where the header does not contain enough info to determine the reference genome used for alignment.)
It would be great if you can paste one such header. I would go by the name, number of chromosomes to know which organism's reference genome was used for the alignment. You can also pick up an aligned read from the BAM file and blat it against several organisms and check if the output coordinates from the blat match with coordinates specified in the BAM file. Of course, everything I mentioned will only work for known popular reference genomes and not for a very specific yeast of strain that was de novo assembled by a particular lab and used as a reference genome for some RNA-seq experiment.
PS: OP edited the question and is interested in knowing which reference assembly was used for alignment rather than which organism?
As is suggested by others, run samtools view -H and give the output. A mapped BAM has to have @SQ lines. If the BAM generator allows to produce a BAM without @SQ, it is a bug and should get reported to the developers. Once you have @SQ, the easiest way to find the reference build is to match the chromosomes lengths. It works most of time. Some BAMs have md5sum. That will be more reliable.
It's weird that there are no SN tags in the header. SN or reference sequence name are required tags so every bam file should have them. It is an exome sequencing data because they used SureSelect V4 capture kit (Agilent) and they have used HGA-SURESELECTV4.unmasked.fa as a reference genome. I would select a few aligned reads from the bam file and blat it against different versions of human genome at UCSC genome browser, and compare the coordinates in the bam file with the blat results. This should tell you about the reference assembly (hg19 or hg38) used. This approach may not distinguish between smaller subversion changes within a given assembly but it should be able to distinguish between hg18 and hg19 or hg19 and hg38.
Where did you get these BAM files? Are they from published results? Or collaborators? Best bet would be reading the paper or asking.
Anyway, by the command-line from the BAM file, I would guess reads were aligned to a fasta obtained somehow from Agilent SureSelectV4, maybe just the probes, maybe to the exons on which the probes match. But it is just a wild guess, really.
The BAM files were uploaded by one of our users (it is from clinical exome sequencing). Due to the number of BAM files that are being uploaded each day, we are seeing a lot of variations in the headers. Because of this, we are focused on automating a way to determine the reference genome in cases like this (where the header doesn't provide enough info, the header may have been obscured for some reason or the header may even be missing) as opposed to asking the lab that produced the file.
So far it sounds like using BLAT may provide a way to automate the determination of the major reference genome used for the alignment so we'll be looking into this further (how to automate this using BLAT).
Well then, I would suggest to force the Users to report the genome versions they used to generated the BAM files. Or even let them upload the reference in cases it is not a common version. You can develop a script based on my answer to check for the common versions. If Users tend to use many different versions, it is actually best to make a guideline of which version to use. If not possible, in special, Users have to report genome version and upload reference file. You can make a sample submission page to help with this.
2. Compare chromosome names and sizes with that of various versions of human genome:
hg18, GRCh36, hg19, GRCh37, hg38, GRCh38
Some clues:
Each version has different sizes of many chromosomes. So chromosome sizes will help to differentiate major releases.
hg* line (for the lack of better term in my head right now) names chromosomes starting with 'chr', whereas GRCh* line does with only chromosome number: '1' is equivalent with chr1. So chromosome naming helps differentiate line.
For calculation of chromosome length from a fasta file (i.e. of the human genome), install bioawk and run this:
If you know that your BAM file represents RNA-seq or exome data, you can count the number of reads overlapping exon coordinates for every relevant genome build and see which one gives the maximum percent of reads overlapping gene exons. The genome build with the highest % overlap is the correct one. You can do likewise for any other type of data that is enriched for a known set of genomic regions (e.g. promoters, miRNA, etc.).
Another option is to simply extract the reads and re-align them yourself, and either compare to see which genome build yields the most similar alignments, or just forget about the original alignments and use your new ones instead.
Had the same problem. For me using samtools view -H bamfile.bam gave the header of the BAM file, and the header also contained the exact bowtie command that was used to create the BAM file -
It would be great if you can paste one such header. I would go by the name, number of chromosomes to know which organism's reference genome was used for the alignment. You can also pick up an aligned read from the BAM file and blat it against several organisms and check if the output coordinates from the blat match with coordinates specified in the BAM file. Of course, everything I mentioned will only work for known popular reference genomes and not for a very specific yeast of strain that was de novo assembled by a particular lab and used as a reference genome for some RNA-seq experiment.
PS: OP edited the question and is interested in knowing which reference assembly was used for alignment rather than which organism?
Thank you for your reply. It is a human genome (this is known and to clarify I just added the info to the original question).
Here is the header info:
As is suggested by others, run samtools view -H and give the output. A mapped BAM has to have @SQ lines. If the BAM generator allows to produce a BAM without @SQ, it is a bug and should get reported to the developers. Once you have @SQ, the easiest way to find the reference build is to match the chromosomes lengths. It works most of time. Some BAMs have md5sum. That will be more reliable.
It's weird that there are no SN tags in the header. SN or reference sequence name are required tags so every bam file should have them. It is an exome sequencing data because they used SureSelect V4 capture kit (Agilent) and they have used
HGA-SURESELECTV4.unmasked.fa
as a reference genome. I would select a few aligned reads from the bam file and blat it against different versions of human genome at UCSC genome browser, and compare the coordinates in the bam file with the blat results. This should tell you about the reference assembly (hg19 or hg38) used. This approach may not distinguish between smaller subversion changes within a given assembly but it should be able to distinguish between hg18 and hg19 or hg19 and hg38.Where did you get these BAM files? Are they from published results? Or collaborators? Best bet would be reading the paper or asking.
Anyway, by the command-line from the BAM file, I would guess reads were aligned to a fasta obtained somehow from Agilent SureSelectV4, maybe just the probes, maybe to the exons on which the probes match. But it is just a wild guess, really.
The BAM files were uploaded by one of our users (it is from clinical exome sequencing). Due to the number of BAM files that are being uploaded each day, we are seeing a lot of variations in the headers. Because of this, we are focused on automating a way to determine the reference genome in cases like this (where the header doesn't provide enough info, the header may have been obscured for some reason or the header may even be missing) as opposed to asking the lab that produced the file.
So far it sounds like using BLAT may provide a way to automate the determination of the major reference genome used for the alignment so we'll be looking into this further (how to automate this using BLAT).
For a mapped BAM, @SQ info can't be missing, or the file is corrupted.
Well then, I would suggest to force the Users to report the genome versions they used to generated the BAM files. Or even let them upload the reference in cases it is not a common version. You can develop a script based on my answer to check for the common versions. If Users tend to use many different versions, it is actually best to make a guideline of which version to use. If not possible, in special, Users have to report genome version and upload reference file. You can make a sample submission page to help with this.