Question

Download hundreds of genes' variant csv from gnomAD

2

Entering edit mode

6.1 years ago

Qingyang Xiao ▴ 160

Now I have 500 genes of interest that I want to download from gnomAD for SNP analysis.

It will take forever if I type the each gene name and click the button "Export to csv".

How can I do that in batches?

genome SNP • 5.6k views

ADD COMMENT • link updated 3.2 years ago by Kalin ▴ 50 • written 6.1 years ago by Qingyang Xiao ▴ 160

score 5 · Answer 1 · 2019-04-07

5

Entering edit mode

6.1 years ago

Pierre Lindenbaum 166k

wget -O - "https://storage.googleapis.com/gnomad-public/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz" | gunzip -c | grep -E '(^#|\|(GENE1|GENE2|GENE3|GENE4)\|)' > genes.vcf

ADD COMMENT • link 6.1 years ago by Pierre Lindenbaum 166k

3

Entering edit mode

If you are interested in specific genes, you would probably want to use gnomAD exomes, not genomes. It's based on more samples and the file is substantially smaller.

ADD REPLY • link 6.1 years ago by igor 13k

1

Entering edit mode

Small suggestion: If you have the disk space (something in the order of ~1TB), you could output wget to a temporary (i.e. wget -O - "https://storage.googleapis.com/gnomad-public/release/2.1.1/vcf/genomes/gnomad.genomes.r2.1.1.sites.vcf.bgz" > gnomad.vcf.bgz) and then query the file with gunzip + grep after in case you want to look at different genes, or you notice a typo etc. You could also do it per chromosomes and only grep the genes that match the chromosomes you need (see download page).

Since you have 500 genes, you could also put them in a text file (one gene per row) and provide the file as your list of search strings by modiying the grep part here to do gunzip -c gnomad.vcf.bgz | grep -E -f mygenes.txt.

Also keep in mind that grep with match whatever text is present; if you have gene symbols and some gene is a substring of something unrelated, it'll get matched, so you should definitely analyse your output for correct matches.

Finally, do you have gene symbols, or gene identifier (e.g. Ensembl, or RefSeq)? I would download the smallest file (chr21 sites VCF (6.12 GiB)) first and check that your inputs will work with what the gnomAD vcf provides, and then try on the whole dataset.

ADD REPLY • link 6.1 years ago by mbelmadani ★ 1.4k

0

Entering edit mode

But, VCF files don't have gene names/symbols, correct? Maybe have to convert your gene name list into start:end coordinates. I have a similar task, and I'd love help on the matter.

ADD REPLY • link 4.2 years ago by jcs92 • 0

0

Entering edit mode

Thanks. If I download .csv file directly from gnomAD, the data is integrated from both gnomAD Genomes and Exomes. But the code above for me only contains the data from only Genomes. Could I get the data integrated from Genomes and Exomes, just like I directly click to download?

ADD REPLY • link 5.3 years ago by Qingyang Xiao ▴ 160

0

Entering edit mode

I don't think there's a single file with both (officially at least) but the exome variants are at https://storage.googleapis.com/gnomad-public/release/2.1.1/vcf/exomes/gnomad.exomes.r2.1.1.sites.vcf.bgz (link from the gnomAD download page: https://gnomad.broadinstitute.org/downloads).

ADD REPLY • link 5.3 years ago by mbelmadani ★ 1.4k

score 0 · Answer 2 · 2022-01-28

0

Entering edit mode

3.2 years ago

Kalin ▴ 50

I created a python package based on SQLite databases, where you can easily query all gnomAD variants for GRCh37/38. https://github.com/KalinNonchev/gnomAD_DB I have precomputed SQLite databases for gnomAD WGS for GRCh37/38 in the description of the package. Please take a look there.

ADD COMMENT • link 3.2 years ago by Kalin ▴ 50