Hello,
I am currently trying to call a consensus sequence for a genome that is 1.2 Gbp in length.
I have a VCF file that has only ~7 million records, including variant and invariant sites.
When I run bcftools consensus Test.vcf.gz -f Reference.fasta -a N
;
I get an output consensus with ONLY 5% N content.
This is confusing to me because I wouldn't expect 7 million records to contain enough information to get a consensus for a genome of this size.
I have confirmed on a test file that consensus is working as expected and not including reference bases that are not covered in the VCF file.
Any idea what could be going on?
You should see @Michael's answer in this thread for caveats as to why this method should not be used : Generating consensus sequence from bam fileYou can usesamtools consensus
as recommended by @Jkbonfield (developer of samtools/bcftools) in thread above.This is a consensus generated from a VCF file, not a bam file. Does that change much?
I am going to delete my post since it is not applicable in your case. I should have checked the command more carefully. Apologies.