Question

Extract length of sample sequence from vcf

0

Entering edit mode

4 days ago

jahnreinhard.ringger • 0

Is there a way to extract the exact length in bp of one sample from a vcf? I'm very new to bioinformatics and working with sequence data, and would appreciate some advice/some questions answered.

I want to check how long specific regions (candidates for primers, microsats, etc) of a sample are in terms of bp. For this, I have a vcf of ~200 samples, filtered to keep just indels, and an indexed reference genome (bacterial, haploid).

What I want to do is check how long sample A is for ex. the region 1:591173-591277 (so that I can later create an array that does this for all samples). The output should ideally be a fasta file with just the bps of 1:591173-591277 for sample A, which I know from previous work should be ~145 bp due to a 40bp insertion in this region.

So far I've tried

samtools faidx Ref.fa 1:591173-591277 | bcftools consensus -f - -s A 591173.vcf.gz > A.consensus.fa

But this returns warnings such as "The site 1:591232 overlaps with another variant, skipping... The site 1:591233 overlaps with another variant, skipping..." and returns a sequence of 99bp, which I know to be incorrect.

I've tried the same with a normalized and filtered file (bcftools norm and bcftools filter --IndelGap 5), which obviously returns an output without any warnings, but does not return the desired 145bp length (instead returns 86bp). As I'm checking for length, it makes sense to keep the indels unfiltered in my opinion. I don't understand why indels should overlap in the consensus file if I am only extracting the sequence of one sample.

Does anyone have any advice/guidance on how to approach this situation?

microsatellites bcftools vcf indels variants • 505 views

ADD COMMENT • link updated 3 hours ago by Istvan Albert 102k • written 4 days ago by jahnreinhard.ringger • 0

0

Entering edit mode

What I want to do is check how long sample A is for ex. the region 1:591173-591277

What do the CIGAR string(s) look like for reads in the region? Are they capturing the insertion

ADD REPLY • link 4 days ago by GenoMax 150k

0

Entering edit mode

Once you slice the reference genome, the positions in the VCF won't match the reference. For your method to work you would need to shift the POS field in the VCF by the same amount.

But more so, your question seems a bit unclear; the VCF file already contains information on both the reference and the variant; why do you need the reference?

ADD REPLY • link 4 days ago by Istvan Albert 102k

0

Entering edit mode

you probably want to be working wihta bam file for this ..

ADD REPLY • link 4 days ago by LauferVA 4.7k

score 4 · Answer 1 · 2025-04-04

4

Entering edit mode

4 days ago

Billy Rowell ▴ 510

This is hacky and definitely not prod ready, but if the goal is to get the length and not the sequence, you could use bcftools/paste/bc to find the difference in lengths between the ALT and REF alleles of the variants in this region:

(
  bcftools query \
    -f '%STRLEN(ALT[0])\n' 591173.vcf.gz \
    -r 1:591173-591277 \
  | paste -sd'+' | bc;
  bcftools query \
    -f '%STRLEN(REF)\n' 591173.vcf.gz \
    -r 1:591173-591277 \
  | paste -sd'+' | bc
) \
| paste -sd'-' | bc

This has a couple requirements/assumptions:

this bcftools query syntax requires features that are only in the most recent unreleased version of bcftools (>v1.21), compiled from source
this requires that your VCF uses the recommended representations of simple indels where the base before the event is included.

ADD COMMENT • link 3 days ago by Billy Rowell ▴ 510

0

Entering edit mode

If you don't want to mess with compiling bleeding-edge bcftools, you could write the same thing in awk. There's no need to filter out the SNPs first, as they won't change the sum.

bcftools view -H -f "PASS" 591173.vcf.gz 1:591173-591277 \
| awk -F '\t' 'BEGIN {sum=0}; {sum+=length($5)-length($4)}; END {print sum};'

ADD REPLY • link 3 days ago by Billy Rowell ▴ 510

0

Entering edit mode

unfortunately both of these did not work for my case, as both return the difference between the REF and ALT alleles of all samples, and not between the REF and each sample, even when I just create a vcf containing one genotype.

Is there a method to retrieve the length of just one genotype with awk, instead of all genotypes?

ADD REPLY • link 8 hours ago by jahnreinhard.ringger • 0

0

Entering edit mode

I would use a tool like cyvcf2

https://brentp.github.io/cyvcf2/

It provides a simple pythonic interface to VCF data. You can then easily write programs like (this one is ChatGPT generated, though, since I am pressed for time ATM):

from cyvcf2 import VCF

vcf = VCF("591173.vcf.gz")

# Define region of interest
region = "1:591173-591277"

ref_len = 0
alt_len = 0

for variant in vcf(region):
    ref_len += len(variant.REF)
    # ALT can have multiple alleles; only take the first (ALT[0])
    if variant.ALT and variant.ALT[0]:
        alt_len += len(variant.ALT[0])

# Compute the difference: sum(ALT[0] lengths) - sum(REF lengths)
print(alt_len - ref_len)

ADD REPLY • link 3 hours ago by Istvan Albert 102k