Question

How extract data from gnomAD for some genes?

0

Entering edit mode

5.3 years ago

star ▴ 350

I like to extract variant details about some genes from gnomAD 'https://gnomad.broadinstitute.org/', but I don`t want to do it by hand. I would like to know, is there any way to extract it by command line for below genes?

Table A:

RFPL3
RGMA
RGS11
RHOT1
RIPOR1
RIPOR3
RNASE9
RNF145
RNF24
RPL17
RPL17-C18orf32
RPL19
RPL7
SAMD3
SCAMP4
SCNM1

There is a 'gsutil ls gs://gnomad-public/release/2.1.1' command in gnomad but I don`t know how can I use it to extract data for my list?

gnomAD data variants • 3.4k views

ADD COMMENT • link updated 2.9 years ago by Kalin ▴ 50 • written 5.3 years ago by star ▴ 350

2

Entering edit mode

gsutil can be used to download data. For the moment, you'll either need to download the entire VCF and do some extraction + manual processing or download CSV files gene-by-gene.

ADD REPLY • link 5.3 years ago by Ram 44k

score 1 · Answer 1 · 2022-01-28

I created a python package based on SQLite databases, where you can easily query all gnomAD variants for GRCh37/38. https://github.com/KalinNonchev/gnomAD_DB I have precomputed SQLite databases for gnomAD WGS for GRCh37/38 in the description of the package. Please take a look there. You could query the exon intervals of the genes using it.

score 0 · Answer 2 · 2019-09-16

0

Entering edit mode

5.3 years ago

noodle ▴ 600

From R you can use the packages GenomicRanges and/or VariantAnnotation and/or vcfR and follow this workflow;

#defines the region you want to look at
rngs <- GRanges(chr, IRanges(query.start,query.end))
names(rngs) <- gene.name
param <- ScanVcfParam(which=rngs) 
this.vcf.data <- readVcf(genome.vcf, "hg19", param)

#retrieves useable information from vcf
rd <- rowRanges(vcf)
basic.info.df = as.data.frame(rd)
expanded.info.df = as.data.frame(info(vcf))

ADD COMMENT • link 5.3 years ago by noodle ▴ 600

0

Entering edit mode

This is one part of the solution. It does not address obtaining the VCF file or the genomic co-ordinates of the genes. Given that it is incomplete, I'd recommend that this content be moved to a comment rather than continue being an answer.

ADD REPLY • link 5.3 years ago by Ram 44k

0

Entering edit mode

Well thank goodness for high standards...below is a bit more which addresses your concerns, although the transcript should be known along with the gene name and the vcf files should be downloaded locally via ftp or gs (or from web browser) here

This will also require RMySQL and I show here retrieving +/- 4000bp but you can change this however you like...

conn <- dbConnect(MySQL(), user = 'genome', password = '', host = 'genome-mysql.cse.ucsc.edu', dbname = 'hg19')

NM.query=c()
NM.query <- dbGetQuery(conn, paste("SELECT * FROM refGene WHERE name = '",RefSeq,"' and name2 = '",gene.name,"'",sep=""))
this.chrom = NM.query$chrom
this.strand = NM.query$strand
if(NM.query$strand == "-"){
  stop.here = NM.query$txEnd + 4000
  start.here = NM.query$txStart - 4000
}
if(NM.query$strand == "+"){
  start.here = NM.query$txStart - 4000
  stop.here = NM.query$txEnd + 4000
}

query.start = start.here
query.end = stop.here

#defines the region you want to look at
rngs <- GRanges(chr, IRanges(query.start,query.end))
names(rngs) <- gene.name
param <- ScanVcfParam(which=rngs) 
this.vcf.data <- readVcf(genome.vcf, "hg19", param)

#retrieves useable information from vcf
rd <- rowRanges(vcf)
basic.info.df = as.data.frame(rd)
expanded.info.df = as.data.frame(info(vcf))

https://gnomad.broadinstitute.org/downloads

ADD REPLY • link 5.3 years ago by noodle ▴ 600