Question

Extracting Variant Sequences from 1000 Genomes VCF File and Mapping to Canonical Gene Sequence

0

Entering edit mode

2 days ago

Rohan ▴ 40

I have the variant file for all chromosomes and populations from the 1000 Genomes Project:

ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz
ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz.tbi

Additionally, I have the canonical sequence of the FFAR1 gene in a FASTA format:

>FFAR1
MDLPPQLSFGLYVAAFALGFPLNVLAIRGATAHARLRLTPSLVYALNLGCSDLLLTVSLPLKAVEALASGAWPLPASLCPVFAVAHFFPLYAGGGFLAALSAGRYLGAAFPLGYQAFRRPCYSWGVCAAIWALVLCHLGLVFGLEAPGGWLDHSNTSLGINTPVNGSPVCLEAWDPASAGPARFSLSLLLFFLPLAITAFCYVGCLRALARSGLTHRRKLRAAWVAGGALLTLLLCVGPYNASNVASFLYPNLGGSWRKLGLITGAWSVVLNPLVTGYLGRGPGLKTVCAARTQGGKSQK

I have managed to extract the FFAR1 variant region from the VCF file using the following command:

bcftools view -r 19:35347902-35353864 -o ffar1_variant.vcf ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz

Now, I want to extract all the variant sequences for each sample (or population) from the VCF file and map the variants onto the canonical FFAR1 gene sequence. Specifically, I need to generate output similar to:

>HG02922
MDLPPQLSFGLYVAAFALGFPLNVLAIRGATAHARLRLTPSLVYALNLGCSDLLLTVSLPLKAVEALASGAWPLPASLCPVFAVAHFFPLYAGGGFLAALSAGRYLGAAFPLGYQAFRRPCYSWGVCAAIWALVLCHLGLVFGLEAPGGWLDHSNTSLGINTPVNGSPVCLEAWDPASAGPARFSLSLLLFFLPLAITAFCYVGCLRALARSGLTHRRKLRAAWVAGGALLTLLLCVGPYNASNVASFLYPNLGGSWRKLGLITGAWSVVLNPLVTGYLGRGPGLKTVCAARTQGGKSQK

Where each sample (like HG02922) would have its FFAR1 sequence with any genetic variants based on the VCF file. I want to compare these sequences to the canonical one and identify the variations.

I am looking for a way to:

Parse the VCF file for all samples (or populations).
Map the variants to the canonical FFAR1 gene sequence.
Generate an output file with the updated sequence for each sample.

Could anyone help me with how to proceed with these steps or suggest any tools or scripts to automate this task?

bcftools genome vcf wgs • 270 views

ADD COMMENT • link 20 hours ago by Rohan ▴ 40

0

Entering edit mode

related: Generate peptide sequences from VCF

ADD REPLY • link 2 days ago by Jeremy Leipzig 23k

score 1 · Answer 1 · 2025-01-07

1

Entering edit mode

1 day ago

Jeremy Leipzig 23k

Before diving into this, I would definitely look closely at:

the VEP/SnpEFF annotations of single-sample VCFs
haplosaurus

If you try to rebuilt this from scratch there will be some significant challenges with:

decoding numeric GT's in multisample VCFs into the correct ALTs
identifying the correct ENST transcript and proper genomic->transcript shifts
frameshifts

Fortunately you are looking at just one gene.

What is the end goal here?

ADD COMMENT • link 1 day ago by Jeremy Leipzig 23k

0

Entering edit mode

Thank you for the response!

I am extracting the FFAR1 gene from all samples in the 1000 Genomes Project and other genomic databases to identify subtle missense mutations or any type of single nucleotide polymorphism (SNP). My goal is to map these variations to specific populations, analyze population-specific variations in the gene, and explore their potential roles in gene functionality.

ADD REPLY • link 20 hours ago by Rohan ▴ 40