Question

Germline variant calling from BAM without reference

0

Entering edit mode

2.8 years ago

sbstevenlee ▴ 480

Hi all,

All the famous germline variant callers I know (e.g. GATK's HaplotypeCaller) require the identical reference FASTA file that was used to perform sequence alignment to make the BAM files. However, I'm curious to find out if there's a way to perform variant calling without reference genome.

For example, if I have a BAM file from 30x WGS sample with 150 bp PE reads mapped to GRCh37, can I theoretically perform variant calling without reference genome (e.g. using samtools mpileup) for a small selected region (~100 kb)? For the purpose of this question, let's say I'm only interested in biallelic SNVs (i.e. no indels and no multiallelic sites). I'm also willing to sacrifice specificity in variant calling if I can drop the reference requirement because I have a robust filtering scheme in place down the road.

I have a feeling that each uniquely mapped read containing a SNV in such BAM file should contain some information about the difference in DNA sequence caused by the SNV (i.e. REF vs. ALT), but I can't seem to find any information about it when I manually look at sequence reads in IGV. I did come across the Cigar string, but it seems that it's only for indels -- a sequence read with SNV had a string "151M" because the read still aligned to that SNV position.

Are there any tags in BAM I can extract from each sequence read to infer REF allele vs. ALT allele without reference genome? Or am I understanding this wrong in that once sequence reads are mapped their variant information is lost along the way?

Any help would be greatly appreciated.

vcf calling bam variant • 2.5k views

ADD COMMENT • link 2.8 years ago by sbstevenlee ▴ 480

score 1 · Accepted Answer · 2022-02-02

1

Entering edit mode

2.8 years ago

Istvan Albert 102k

If you knew the variants that you are looking for you can check whether the variant is present from the FASTQ file alone, you don't even need a bam file.

Such a method would work by constructing a sufficiently long k-mer that contains the variant, then looking at the kmers built from the FASTQ file. It works much faster than alignment-based methods.

A quick literature search shows:

https://www.nature.com/articles/s41598-017-02487-5

https://pubmed.ncbi.nlm.nih.gov/25404127/

there are many other alternatives.

ADD COMMENT • link 2.8 years ago by Istvan Albert 102k

0

Entering edit mode

Istvan, thank you very much for introducing me to an alternative way of thinking about this.

While this is useful, my situation calls for an ability to detect all variants, both known and novel, in a selected region as opposed to known variants only. Also, all of my sequence files are BAM -- i.e. they are already aligned to a reference genome. While I could still use the k-mer based approach you mentioned for BAM files, I'm just curious if we could extract per-base reference allele information directly from BAM files, since this information was used to align the reads in the first place.

Ultimately, I was thinking about a variant calling pipeline for aligned BAM files where I use a tool like samtools mpileup to get pileup data and then parse it to call variants. Again, I'm aware there are other famous callers such as GATK's HaplotypeCaller and bcftools -- this new pipeline I'm thinking about is specifically for generating an input VCF for my other Python program (I can get pipelup data with pysam.pileup).

I was thinking I wouldn't need a reference genome if we can distinguish REF vs. ALT alleles in the pileup or in the BAM -- does anyone know if this is possible (e.g. using certain BAM tags)?

ADD REPLY • link 2.8 years ago by sbstevenlee ▴ 480

0

Entering edit mode

If the BAM file has MD tags present then it is fairly simple to reconstruct the reference file from the BAM file. The MD tag is needed to figure out deletions, where the CIGAR string has 1D the MD tag will contain the base that was deleted. If there is no MD tag then you need the reference sequence to identify the deletions.

But I want to make a note that the two concepts are distinct and completely unrelated to one another.

Reconstructing the REF from a BAM file is a completely different task from finding a valid ALT. Every single alignment correctly and unambiguously describes the REF.

The challenge of variant callers is how to find the correct ALT because different ALT arrangements all indicating the same REF may be possible. The variant callers basically take a pileup and then, if ambiguity exists, attempt to rearrange the ALTs in a way to most likely be correct.

ADD REPLY • link 2.8 years ago by Istvan Albert 102k

0

Entering edit mode

Thanks very much for letting me know about the MD tag! Exactly what I was looking for. And I appreciate the heads up -- will keep in mind.

ADD REPLY • link 2.8 years ago by sbstevenlee ▴ 480