Good morning, I’m new in bioinformatics and I need some help with a task . I’m trying to create a consensus file fasta from a sorted.marked.bam with bcftools. The command I’m using is:
call variants
bcftools mpileup -B -Ou -f reference.fa alignments.bam | bcftools call -mv -M -Oz -o calls.vcf.gz
bcftools index calls.vcf.gz
normalize indels
bcftools norm -f reference.fa calls.vcf.gz -Ob -o calls.norm.bcf
filter adjacent indels within 5bp
bcftools filter --IndelGap 5 calls.norm.bcf - Ob -o calls.norm.flt-indels.bcf
apply variants to create consensus sequence
cat reference.fa | bcftools consensus -H I -M calls.vcf.gz > consensus.fa
My doubt is how can I save information about Heterozygous positions?
If in a specific position there are multiple ALTs with different frequencies which one will be insert in the consensus sequence?
What I mean is that if I have a situation like this
position:100 REF:C ALT:C,G count(C):39 count(G):1
I can easily say that the correct nucleotide is C and that will be the nucleotide I see in the consensus.
While if I have a situation like this:
position:100 REF:C ALT:C,G count(C):7 count(G):7
It’s probably an Heterozygous and in the consensus I should find the S according to the IUPAC annotation for C or G bases.
But what if I have something like this:
position:100 REF:C ALT:C,G count(C):9 count(G):5
How does bcftools Heterozygous position from sequencing errors? Does it come automatically or should I insert some parameters ?
Thank you for your help.