How to extract and analyze variant sequences from gnomAD v4 exome VCF
0
0
Entering edit mode
11 weeks ago
Rohan ▴ 40

I am working on a pharmacogenomics study focusing on the GPCR gene family and analyzing variant sequences in ~700,000 individuals from the gnomAD v4 exome database. My goal is to extract all possible variant sequences for a specific gene, map these sequences to populations, and analyze the variations.

Pipeline I Used (Works for 1000 Genomes Project VCF):

# Extract gene-specific region and prepare reference
bcftools view -Oz -r chromosome_number:start-end vcf.gz > gene.1000g.vcf.gz
tabix -p vcf gene.1000g.vcf.gz
sed -e 's/^>chr/>/' GRCh38_full_analysis_set_plus_decoy_hla.fa > out.fa
samtools faidx out.fa

# Generate consensus sequences for each sample
bcftools view -h gene.1000g.vcf.gz | grep "^#CHROM" | cut -f10- | tr '\t' '\n' | while read -r sample; do
  bcftools view -c1 -Oz -s "$sample" -o "1000g.$sample.vcf.gz" gene.1000g.vcf.gz
  samtools faidx out.fa "chromosome_number:start-end" | bcftools consensus "1000g.$sample.vcf.gz" -o "1000g.gene.$sample.fasta"
done

This pipeline works well for the 1000 Genomes dataset, where individual sample information is present in the VCF headers.

Problem with gnomAD v4 Exome VCF:

  1. The gnomAD exome VCF does not contain individual sample information but rather aggregated variant sites.
  2. When I attempt to generate consensus sequences using the following approach:
bcftools view -Oz -r chr1:23874535-23875617 gnomad.exomes.v4.1.sites.chr1.vcf.bgz > gene.gnomad.vcf.gz
tabix -p vcf gene.gnomad.vcf.gz
samtools faidx GRCh38_full_analysis_set_plus_decoy_hla.fa chr1:23874535-23875617 > reference.fa
bcftools consensus gene.gnomad.vcf.gz -f reference.fa -o gene.gnomad.fasta

I receive errors such as:

Note: applying REF,ALT variants  
The site chr1:23874536 overlaps with another variant, skipping...  
The site chr1:23874538 overlaps with another variant, skipping...

Questions:

  1. How can I effectively extract all possible variant sequences for a GPCR gene from the gnomAD v4 exome database, given that individual-level data is not available?
  2. What is the best strategy to generate consensus sequences representing all possible variant combinations for downstream analysis?
  3. How can I map these sequences to their respective populations or frequency distributions available in gnomAD metadata?

Any advice or suggestions for improving my pipeline or handling this type of genomic data would be greatly appreciated.

Thank you!

samtools bcftools exome vcf genomics • 451 views
ADD COMMENT
1
Entering edit mode

Perhaps, gnomAD discussion page is better place to get answer to this question. https://discuss.gnomad.broadinstitute.org/

ADD REPLY
0
Entering edit mode

Thanks, I'll discuss it over there.

ADD REPLY

Login before adding your answer.

Traffic: 1166 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6