How to Resolve Reference Mismatch While Extracting Gene Variants from 1000 Genomes Project Phase 3 VCF?
1
3
Entering edit mode
1 day ago
Rohan ▴ 40

I am attempting to extract the PI2R gene from chromosome 19 of Homo sapiens using the Phase 3 VCF file of the 1000 Genomes Project. My goal is to:

  1. Identify variants for each sample,
  2. Retrieve the corresponding nucleotide sequence,
  3. Translate it into protein,
  4. Detect missense variations and CNVs.

I followed these steps using this post BCFtools consensus as a reference:

  1. Extracted the PI2R region:

    bcftools view -Oz -r 19:46618550-46625089 ALL.chr19.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz > pi2r.1000g.vcf.gz
    tabix -p vcf pi2r.1000g.vcf.gz
    
  2. Prepared the reference genome:

    samtools faidx GRCh38_full_analysis_set_plus_decoy_hla.fa
    sed -e 's/^>chr/>/' GRCh38_full_analysis_set_plus_decoy_hla.fa > out.fa
    samtools faidx out.fa
    
  3. Iterated through samples and generated sequences:

    for sample in `bcftools view -h pi2r.1000g.vcf.gz | grep "^#CHROM" | cut -f10-`; do
    bcftools view -c1 -Oz -s $sample -o 1000g.$sample.vcf.gz pi2r.1000g.vcf.gz
    tabix -p vcf 1000g.$sample.vcf.gz
    samtools faidx out.fa 19:46618550-46625089 | bcftools consensus 1000g.$sample.vcf.gz -o 1000g.pi2r.$sample.fasta
    done
    

However, I encountered the following error:

The fasta sequence does not match the REF allele at 19:46618630:
   REF .vcf: [A]
   ALT .vcf: [G]
   REF .fa : [T]CGCAGTAAGATACACATAACATAAAAGTTGCTATTTTAA
The fasta sequence does not match the REF allele at 19:46618630:
   REF .vcf: [A]
   ALT .vcf: [G]
   REF .fa : [T]CGCAGTAAGATACACATAACATAAAAGTTGCTATTTTAA
The fasta sequence does not match the REF allele at 19:46618630:
   REF .vcf: [A]
   ALT .vcf: [G]
   REF .fa : [T]CGCAGTAAGATACACATAACATAAAAGTTGCTATTTTAA
The fasta sequence does not match the REF allele at 19:46618630:
   REF .vcf: [A]
   ALT .vcf: [G]
   REF .fa : [T]CGCAGTAAGATACACATAACATAAAAGTTGCTATTTTAA
The fasta sequence does not match the REF allele at 19:46618630:
   REF .vcf: [A]
   ALT .vcf: [G]
   REF .fa : [T]CGCAGTAAGATACACATAACATAAAAGTTGCTATTTTAA
The fasta sequence does not match the REF allele at 19:46618630:
   REF .vcf: [A]
   ALT .vcf: [G]
   REF .fa : [T]CGCAGTAAGATACACATAACATAAAAGTTGCTATTTTAA

Question:

  1. Am I using the correct reference genome for the Phase 3 VCF alignment? If not, which reference sequence should I use?
  2. Could there be other reasons for this mismatch? How can I resolve it to ensure the consensus sequences match the VCF data?
samtools bcftools grch38 genome vcf • 180 views
ADD COMMENT
4
0
Entering edit mode

Thank you for your response and clarification!

ADD REPLY

Login before adding your answer.

Traffic: 2932 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6