Reference File From Bam
1
3
Entering edit mode
12.2 years ago
win ▴ 990

Hi all,

Hope someone can help. I will be getting several BAM files from users and they may not inform us which reference file to use. Is there any way to look into a BAM file and know for sure or some way that I could infer which file to use.

For e.g. I believe that 1000 Genomes uses their own bam file whereas Illumina used UCSC fasta files?

Any ideas?

Thanks,
A

bam • 8.2k views
ADD COMMENT
2
Entering edit mode

Uhm... yell at those users? Apply the LART until they provide the required information?

I have a really difficult time seeing why it's necessary to accomodate lusers who don't even know what their own files contain.

ADD REPLY
0
Entering edit mode

What is a reference file & why you need it for BAM and where Illumina used UCSC fasta files?

ADD REPLY
0
Entering edit mode

1000 Genomes uses the GRCh37 reference (still a fast file, though), which should have identical coordinates for autosomes as UCSC hg19. The chromosome naming conventions differ though, and there are some unplaced contains/scaffolds that differ between the two as well, I think.

ADD REPLY
6
Entering edit mode
12.2 years ago
matted 7.8k

That's an unfortunate situation, but maybe unavoidable sometimes.

You will get chromosome names and lengths in the header of the BAM (samtools view -H test.bam).

Good pipelines will put in optional fields that describe each reference well, e.g. from a 1000 Genomes BAM:

@SQ     SN:1    LN:249250621    M5:1b22b98cdeb4a9304cb5d48026a85128     UR:ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz        AS:NCBI37       SP:Human

Minimal pipelines will just have the length of each chromosome:

@SQ     SN:chr1     LN:249250621

In the worst case, you can infer the reference from the chromosome names (and number of chromosomes) and the assembly version by the sizes. I think they differ by a few bases e.g. from hg17 to hg18 to hg19. If for some reason they don't, you can look at reads around inter-reference variant sites and see which allele is called as matching the reference.

This is all pretty nasty though. You could also realign the reads to a reference you choose and know by handing the BAM directly to bwa (and other aligners take BAM directly by now as well).

ADD COMMENT

Login before adding your answer.

Traffic: 2068 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6