I have the variant file for all chromosomes and populations from the 1000 Genomes Project:
- ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz
- ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz.tbi
Additionally, I have the canonical sequence of the FFAR1 gene in a FASTA format:
>FFAR1
MDLPPQLSFGLYVAAFALGFPLNVLAIRGATAHARLRLTPSLVYALNLGCSDLLLTVSLPLKAVEALASGAWPLPASLCPVFAVAHFFPLYAGGGFLAALSAGRYLGAAFPLGYQAFRRPCYSWGVCAAIWALVLCHLGLVFGLEAPGGWLDHSNTSLGINTPVNGSPVCLEAWDPASAGPARFSLSLLLFFLPLAITAFCYVGCLRALARSGLTHRRKLRAAWVAGGALLTLLLCVGPYNASNVASFLYPNLGGSWRKLGLITGAWSVVLNPLVTGYLGRGPGLKTVCAARTQGGKSQK
I have managed to extract the FFAR1 variant region from the VCF file using the following command:
bcftools view -r 19:35347902-35353864 -o ffar1_variant.vcf ALL.wgs.phase3_shapeit2_mvncall_integrated_v5c.20130502.sites.vcf.gz
Now, I want to extract all the variant sequences for each sample (or population) from the VCF file and map the variants onto the canonical FFAR1 gene sequence. Specifically, I need to generate output similar to:
>HG02922
MDLPPQLSFGLYVAAFALGFPLNVLAIRGATAHARLRLTPSLVYALNLGCSDLLLTVSLPLKAVEALASGAWPLPASLCPVFAVAHFFPLYAGGGFLAALSAGRYLGAAFPLGYQAFRRPCYSWGVCAAIWALVLCHLGLVFGLEAPGGWLDHSNTSLGINTPVNGSPVCLEAWDPASAGPARFSLSLLLFFLPLAITAFCYVGCLRALARSGLTHRRKLRAAWVAGGALLTLLLCVGPYNASNVASFLYPNLGGSWRKLGLITGAWSVVLNPLVTGYLGRGPGLKTVCAARTQGGKSQK
Where each sample (like HG02922
) would have its FFAR1 sequence with any genetic variants based on the VCF file. I want to compare these sequences to the canonical one and identify the variations.
I am looking for a way to:
- Parse the VCF file for all samples (or populations).
- Map the variants to the canonical FFAR1 gene sequence.
- Generate an output file with the updated sequence for each sample.
Could anyone help me with how to proceed with these steps or suggest any tools or scripts to automate this task?
related: Generate peptide sequences from VCF