I have a list of genes such as BRCA1, BRCA2, PTEN etc and I want a list of SNPs found in each gene. How could I do that programmatically?
Thanks in advance.
I have a list of genes such as BRCA1, BRCA2, PTEN etc and I want a list of SNPs found in each gene. How could I do that programmatically?
Thanks in advance.
If you first need a BED file of genes (the genomic locations of your genes), then let's say you have a file called genes.txt
that contains your gene names of interest:
$ cat genes.txt BRCA1 BRCA2 PTEN $
You can pull GENCODE records (via wget
), filter them for genes (via awk
), filter again for each of your gene names (via grep -f
), and convert them to sorted BED (via convert2bed
):
$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \ | gunzip --stdout - \ | awk '$3=="gene"' - \ | grep -f genes.txt - \ | convert2bed -i gff - \ > genes.bed
No sort-bed
step is required here; the default output of convert2bed
is sorted (assuming BEDOPS is installed).
Once you have genes.bed
, you can use it as the input for the other answer I provided in this post.
Programmatically, here's a demonstration of how to do this for 1000Genome SNPs on chrY
.
First, start with a sorted set of genes in BED format:
$ sort-bed unsorted_genes.bed > genes.bed
(If you don't have locations of your genes, just edit your question to describe what you have, which we can probably work into a BED file.)
Then map 1000Genome SNPs to those genes:
$ wget -qO- ftp://ftp-trace.ncbi.nih.gov//1000genomes/ftp/release/20130502/ALL.chrY.phase3_integrated.20130502.genotypes.vcf.gz \
| gunzip --stdout - \
| convert2bed -i vcf - \
| awk '{print "chr"$0;}' - \
| bedmap --echo --echo-map-id-uniq --chrom chrY genes.bed - \
> answer.chrY.bed
This uses convert2bed to convert VCF to sorted BED, and BEDOPS bedmap
--echo-map-id-uniq
to map a unique list of SNP IDs to each gene in genes.bed
.
We add --chrom chrY
to bedmap
in order focus file operations on the part of the file we're interested in, which improves performance.
The awk
statement prepends each line of the BED output with the string "chr", to reformat the chromosome name to UCSC standards, which is the naming scheme used in GENCODE records pulled in the second answer in this post. We want to make sure the chromosome names line up when mapping genes to SNPs.
This is just for chrY
. You could probably parallelize this for the other human chromosomes pretty easily (see: www.biostars.org/p/63816/ for other examples):
$ parallel "wget -qO- ftp://ftp-trace.ncbi.nih.gov//1000genomes/ftp/release/20130502/ALL.{}.phase3_integrated.20130502.genotypes.vcf.gz | gunzip --stdout - | convert2bed -i vcf - | awk '{print \"chr\"$0;}' | bedmap --echo --echo-map-uniq-id --chrom {} genes.bed - > answer.{}.bed"::: {1..22} X Y
Then collate the per-chromosome results into one file:
$ bedops --everything answer.*.bed > answer.bed
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Could you tell us what kind of data do you have on those genes?
Hi, I just have a list of genes like this:
and I want a list of SNP for these genes.