Hope someone can help. I will be getting several BAM files from users and they may not inform us which reference file to use. Is there any way to look into a BAM file and know for sure or some way that I could infer which file to use.
For e.g. I believe that 1000 Genomes uses their own bam file whereas Illumina used UCSC fasta files?
1000 Genomes uses the GRCh37 reference (still a fast file, though), which should have identical coordinates for autosomes as UCSC hg19. The chromosome naming conventions differ though, and there are some unplaced contains/scaffolds that differ between the two as well, I think.
Minimal pipelines will just have the length of each chromosome:
@SQ SN:chr1 LN:249250621
In the worst case, you can infer the reference from the chromosome names (and number of chromosomes) and the assembly version by the sizes. I think they differ by a few bases e.g. from hg17 to hg18 to hg19. If for some reason they don't, you can look at reads around inter-reference variant sites and see which allele is called as matching the reference.
This is all pretty nasty though. You could also realign the reads to a reference you choose and know by handing the BAM directly to bwa (and other aligners take BAM directly by now as well).
Uhm... yell at those users? Apply the LART until they provide the required information?
I have a really difficult time seeing why it's necessary to accomodate lusers who don't even know what their own files contain.
What is a reference file & why you need it for BAM and where Illumina used UCSC fasta files?
1000 Genomes uses the GRCh37 reference (still a fast file, though), which should have identical coordinates for autosomes as UCSC hg19. The chromosome naming conventions differ though, and there are some unplaced contains/scaffolds that differ between the two as well, I think.