hi
If I had the genomic coords, I can get the intersection of SNPs and my gene list. But right now I just have my gene list.
What would be your method to extract SNPs that fall within a desired gene list?
Cheers
hi
If I had the genomic coords, I can get the intersection of SNPs and my gene list. But right now I just have my gene list.
What would be your method to extract SNPs that fall within a desired gene list?
Cheers
EnsEMBL Perl API (this is largely from the courses they give!):
use strict;
use warnings;
use Bio::EnsEMBL::Registry;
my $reg = 'Bio::EnsEMBL::Registry';
$reg->load_registry_from_db(-host => 'ensembldb.ensembl.org', -user => 'anonymous');
my $gene_name = shift;
my $ga = $reg->get_adaptor('Human', 'Core', 'Gene');
my $sa = $reg->get_adaptor('Human', 'Core', 'Slice');
my $vfa = $reg->get_adaptor('Human', 'Variation', 'VariationFeature');
my $genes = $ga->fetch_all_by_external_name($gene_name);
while (my $gene = shift @{$genes}) {
my $chr = $gene->seq_region_name;
my $start = $gene->seq_region_start;
my $end = $gene->seq_region_end;
my $length = $end - $start + 1;
my $region = sprintf "%s:%d-%d", $chr, $gene->start, $gene->end;
print join("\t", ($gene->stable_id, $region, $length, $gene->external_name, $gene->description) ), "\n";
my $slice = $sa->fetch_by_region('chromosome', $chr, $start, $end);
my @vfs = @{$vfa->fetch_all_by_Slice($slice)};
for my $vf (@vfs) {
print
$vf->variation_name, ' has alleles ', $vf->allele_string,
' located at ', $slice->seq_region_name, ':',
$vf->seq_region_start, '-', $vf->seq_region_end, "\n";
}
}
% perl get_snps_gene_name.pl IL4
ENSG00000113520 5:132009678-132018368 8691 IL4 interleukin 4 [Source:HGNC Symbol;Acc:6014]
rs2070874 has alleles C/T located at 5:132009710-132009710
rs2243251 has alleles A/G located at 5:132009787-132009787
rs4986964 has alleles T/C located at 5:132009821-132009821
rs56279116 has alleles G/A located at 5:132010172-132010172
rs55743996 has alleles C/G located at 5:132010191-132010191
rs56141757 has alleles G/A located at 5:132010216-132010216
rs71645907 has alleles C/G located at 5:132010423-132010423
rs71645908 has alleles C/T located at 5:132010490-132010490
rs113312579 has alleles A/T located at 5:132010563-132010563
rs71645909 has alleles A/G located at 5:132010573-132010573
rs2243252 has alleles T/C located at 5:132010588-132010588
rs2079102 has alleles T/G located at 5:132010606-132010606
rs734244 has alleles C/T located at 5:132010726-132010726
rs2243253 has alleles C/T located at 5:132010761-132010761
rs4996002 has alleles T/G located at 5:132010764-132010764
rs71645910 has alleles C/T located at 5:132010822-132010822
rs71645911 has alleles G/A located at 5:132010826-132010826
rs11479198 has alleles C/- located at 5:132011166-132011166
...
It can also give you consequences of the SNP within the gene (intronic, non-synonymous, etc), phenotypes, 1000 genome population info, etc.
e.g., 1000K population allele frequencies:
my $aa = $reg->get_adaptor($species, 'variation', 'allele');
my $alleles = $aa->fetch_all_by_Variation($var);
for my $a (@$alleles) {
if (defined $a->population and $a->allele eq $$vars{$name}{alt_base}) {
if ($a->population->name eq '1000GENOMES:pilot_1_CEU_low_coverage_panel') {
$allele_freq_string .= 'CEU:' . sprintf("%.2f", 100*$a->frequency) . ';';
} elsif ($a->population->name eq '1000GENOMES:pilot_1_CHB+JPT_low_coverage_panel') {
$allele_freq_string .= 'CHB+JPT:' . sprintf("%.2f", 100*$a->frequency) . ';';
} elsif ($a->population->name eq '1000GENOMES:pilot_1_YRI_low_coverage_panel') {
$allele_freq_string .= 'YRI:' . sprintf("%.2f", 100*$a->frequency) . ';';
}
}
}
print $allele_freq_string . "\n";
Trait/disease association:
my $vaa = $reg->get_adaptor($species, 'variation', 'variationannotation');
foreach my $var_anno (@{$vaa->fetch_all_by_Variation($var)}) {
next if not defined $var_anno->phenotype_description;
$var_pheno_string .= $var_anno->phenotype_description;
if (defined $var_anno->p_value) {
$var_pheno_string .= ' (' . $var_anno->p_value . ')';
}
if (defined $var_anno->associated_variant_risk_allele) {
$var_pheno_string .= ' [risk allele: ' . $var_anno->associated_variant_risk_allele . ']';
}
$var_pheno_string .= ';';
}
print $var_pheno_string . "\n";
Chris
Use Biomart.
For more complex searchers, you can do this in two steps. For example, you can first retrieve the coordinates of the genomic regions of interest using the code Ensembl Genes dataset, and then use these results to query Ensembl Variation.
To automatize these steps, you can also have a look at Galaxy.
You can use the public UCSC mysql server. E.g:
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e '
select
X.geneSymbol,
S.name,
S.chrom,
S.chromStart
from
snp132 as S,
knownGene as K,
kgXref as X
where
S.chrom=K.chrom and
X.kgId=K.name and
not(txEnd<=S.chromStart or S.chromEnd<=K.txStart) and
X.geneSymbol in ("EIF2A","EIF4G1")'
geneSymbol name chrom chromStart
EIF2A rs73869253 chr3 150264619
EIF2A rs114580236 chr3 150264679
EIF2A rs16862731 chr3 150264706
EIF2A rs17280260 chr3 150264778
EIF2A rs115840267 chr3 150264789
(...)
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.