Extracting Variant Sequences from 1000 Genomes VCF File and Mapping to Canonical Gene Sequence
1
0
Entering edit mode
2 days ago
Rohan ▴ 40

I have the variant file for all chromosomes and populations from the 1000 Genomes Project:

  1. ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz
  2. ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz.tbi

Additionally, I have the canonical sequence of the FFAR1 gene in a FASTA format:

>FFAR1
MDLPPQLSFGLYVAAFALGFPLNVLAIRGATAHARLRLTPSLVYALNLGCSDLLLTVSLPLKAVEALASGAWPLPASLCPVFAVAHFFPLYAGGGFLAALSAGRYLGAAFPLGYQAFRRPCYSWGVCAAIWALVLCHLGLVFGLEAPGGWLDHSNTSLGINTPVNGSPVCLEAWDPASAGPARFSLSLLLFFLPLAITAFCYVGCLRALARSGLTHRRKLRAAWVAGGALLTLLLCVGPYNASNVASFLYPNLGGSWRKLGLITGAWSVVLNPLVTGYLGRGPGLKTVCAARTQGGKSQK

I have managed to extract the FFAR1 variant region from the VCF file using the following command:

bcftools view -r 19:35347902-35353864 -o ffar1_variant.vcf ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz

Now, I want to extract all the variant sequences for each sample (or population) from the VCF file and map the variants onto the canonical FFAR1 gene sequence. Specifically, I need to generate output similar to:

>HG02922
MDLPPQLSFGLYVAAFALGFPLNVLAIRGATAHARLRLTPSLVYALNLGCSDLLLTVSLPLKAVEALASGAWPLPASLCPVFAVAHFFPLYAGGGFLAALSAGRYLGAAFPLGYQAFRRPCYSWGVCAAIWALVLCHLGLVFGLEAPGGWLDHSNTSLGINTPVNGSPVCLEAWDPASAGPARFSLSLLLFFLPLAITAFCYVGCLRALARSGLTHRRKLRAAWVAGGALLTLLLCVGPYNASNVASFLYPNLGGSWRKLGLITGAWSVVLNPLVTGYLGRGPGLKTVCAARTQGGKSQK

Where each sample (like HG02922) would have its FFAR1 sequence with any genetic variants based on the VCF file. I want to compare these sequences to the canonical one and identify the variations.

I am looking for a way to:

  1. Parse the VCF file for all samples (or populations).
  2. Map the variants to the canonical FFAR1 gene sequence.
  3. Generate an output file with the updated sequence for each sample.

Could anyone help me with how to proceed with these steps or suggest any tools or scripts to automate this task?

bcftools genome vcf wgs • 270 views
ADD COMMENT
0
Entering edit mode
ADD REPLY
1
Entering edit mode
1 day ago

Before diving into this, I would definitely look closely at:

  • the VEP/SnpEFF annotations of single-sample VCFs
  • haplosaurus

If you try to rebuilt this from scratch there will be some significant challenges with:

  • decoding numeric GT's in multisample VCFs into the correct ALTs
  • identifying the correct ENST transcript and proper genomic->transcript shifts
  • frameshifts

Fortunately you are looking at just one gene.

What is the end goal here?

ADD COMMENT
0
Entering edit mode

Thank you for the response!

I am extracting the FFAR1 gene from all samples in the 1000 Genomes Project and other genomic databases to identify subtle missense mutations or any type of single nucleotide polymorphism (SNP). My goal is to map these variations to specific populations, analyze population-specific variations in the gene, and explore their potential roles in gene functionality.

ADD REPLY

Login before adding your answer.

Traffic: 2285 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6