Question

How to Resolve Reference Mismatch While Extracting Gene Variants from 1000 Genomes Project Phase 3 VCF?

3

Entering edit mode

1 day ago

Rohan ▴ 40

I am attempting to extract the PI2R gene from chromosome 19 of Homo sapiens using the Phase 3 VCF file of the 1000 Genomes Project. My goal is to:

Identify variants for each sample,
Retrieve the corresponding nucleotide sequence,
Translate it into protein,
Detect missense variations and CNVs.

I followed these steps using this post BCFtools consensus as a reference:

Extracted the PI2R region:

bcftools view -Oz -r 19:46618550-46625089 ALL.chr19.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz > pi2r.1000g.vcf.gz
tabix -p vcf pi2r.1000g.vcf.gz

Prepared the reference genome:

samtools faidx GRCh38_full_analysis_set_plus_decoy_hla.fa
sed -e 's/^>chr/>/' GRCh38_full_analysis_set_plus_decoy_hla.fa > out.fa
samtools faidx out.fa

Iterated through samples and generated sequences:

for sample in `bcftools view -h pi2r.1000g.vcf.gz | grep "^#CHROM" | cut -f10-`; do
bcftools view -c1 -Oz -s $sample -o 1000g.$sample.vcf.gz pi2r.1000g.vcf.gz
tabix -p vcf 1000g.$sample.vcf.gz
samtools faidx out.fa 19:46618550-46625089 | bcftools consensus 1000g.$sample.vcf.gz -o 1000g.pi2r.$sample.fasta
done

However, I encountered the following error:

The fasta sequence does not match the REF allele at 19:46618630:
   REF .vcf: [A]
   ALT .vcf: [G]
   REF .fa : [T]CGCAGTAAGATACACATAACATAAAAGTTGCTATTTTAA
The fasta sequence does not match the REF allele at 19:46618630:
   REF .vcf: [A]
   ALT .vcf: [G]
   REF .fa : [T]CGCAGTAAGATACACATAACATAAAAGTTGCTATTTTAA
The fasta sequence does not match the REF allele at 19:46618630:
   REF .vcf: [A]
   ALT .vcf: [G]
   REF .fa : [T]CGCAGTAAGATACACATAACATAAAAGTTGCTATTTTAA
The fasta sequence does not match the REF allele at 19:46618630:
   REF .vcf: [A]
   ALT .vcf: [G]
   REF .fa : [T]CGCAGTAAGATACACATAACATAAAAGTTGCTATTTTAA
The fasta sequence does not match the REF allele at 19:46618630:
   REF .vcf: [A]
   ALT .vcf: [G]
   REF .fa : [T]CGCAGTAAGATACACATAACATAAAAGTTGCTATTTTAA
The fasta sequence does not match the REF allele at 19:46618630:
   REF .vcf: [A]
   ALT .vcf: [G]
   REF .fa : [T]CGCAGTAAGATACACATAACATAAAAGTTGCTATTTTAA

Question:

Am I using the correct reference genome for the Phase 3 VCF alignment? If not, which reference sequence should I use?
Could there be other reasons for this mismatch? How can I resolve it to ensure the consensus sequences match the VCF data?

samtools bcftools grch38 genome vcf • 180 views

ADD COMMENT • link 2 hours ago by Rohan ▴ 40

score 4 · Accepted Answer · 2025-01-07

4

Entering edit mode

19 hours ago

Jeremy Leipzig 23k

The correct reference for that VCF from 2013 is hg37/hg19

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz

see Where can I find the reference files for the 1000 Genome project VCF data?

ADD COMMENT • link 19 hours ago by Jeremy Leipzig 23k

0

Entering edit mode

Thank you for your response and clarification!

ADD REPLY • link 2 hours ago by Rohan ▴ 40