Is there a way to extract the exact length in bp of one sample from a vcf? I'm very new to bioinformatics and working with sequence data, and would appreciate some advice/some questions answered.
I want to check how long specific regions (candidates for primers, microsats, etc) of a sample are in terms of bp. For this, I have a vcf of ~200 samples, filtered to keep just indels, and an indexed reference genome (bacterial, haploid).
What I want to do is check how long sample A is for ex. the region 1:591173-591277 (so that I can later create an array that does this for all samples). The output should ideally be a fasta file with just the bps of 1:591173-591277 for sample A, which I know from previous work should be ~145 bp due to a 40bp insertion in this region.
So far I've tried
samtools faidx Ref.fa 1:591173-591277 | bcftools consensus -f - -s A 591173.vcf.gz > A.consensus.fa
But this returns warnings such as "The site 1:591232 overlaps with another variant, skipping... The site 1:591233 overlaps with another variant, skipping..." and returns a sequence of 99bp, which I know to be incorrect.
I've tried the same with a normalized and filtered file (bcftools norm and bcftools filter --IndelGap 5), which obviously returns an output without any warnings, but does not return the desired 145bp length (instead returns 86bp). As I'm checking for length, it makes sense to keep the indels unfiltered in my opinion. I don't understand why indels should overlap in the consensus file if I am only extracting the sequence of one sample.
Does anyone have any advice/guidance on how to approach this situation?
What do the CIGAR string(s) look like for reads in the region? Are they capturing the insertion
Once you slice the reference genome, the positions in the VCF won't match the reference. For your method to work you would need to shift the POS field in the VCF by the same amount.
But more so, your question seems a bit unclear; the VCF file already contains information on both the reference and the variant; why do you need the reference?
you probably want to be working wihta bam file for this ..