How to get protein names, gene symbols, refseq numbers, gi numbers... using UCSC mysql?
1
0
Entering edit mode
9.5 years ago
grayapply2009 ▴ 300

I want to fetch some information (protein names, gene symbols, refseq numbers, gi numbers...) from human refseq protein database using mysql. I found this code:

mysql --user=genome -N --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e "select name,name2 from refGene" > Refseq2Gene.txt

This code only gives me locus information and gene symbols.

How do I modify this code to get other information I need?

mysql ucsc • 4.8k views
ADD COMMENT
1
Entering edit mode

You might want to try Ensembl, they offer MySQL access as well as a Perl API and Biomart.

ADD REPLY
0
Entering edit mode

How do I do that? Sorry, I'm new to these things.

ADD REPLY
1
Entering edit mode

They have a very detailed help section as well as video tutorials on youtube.

ADD REPLY
3
Entering edit mode
9.5 years ago
biocyberman ▴ 870

I would use biomart for this purpose. Here is the biomart query URL: http://www.ensembl.org/biomart/martview/95b00a8c74fb4a574b028b0f90f51d76

And its Perl rendition: http://www.ensembl.org/biomart/martview/95b00a8c74fb4a574b028b0f90f51d76

# An example script demonstrating the use of BioMart API.
# This perl API representation is only available for configuration versions >=  0.5 
use strict;
use BioMart::Initializer;
use BioMart::Query;
use BioMart::QueryRunner;

my $confFile = "PATH TO YOUR REGISTRY FILE UNDER biomart-perl/conf/. For Biomart Central Registry navigate to
                        http://www.biomart.org/biomart/martservice?type=registry";
#
# NB: change action to 'clean' if you wish to start a fresh configuration  
# and to 'cached' if you want to skip configuration step on subsequent runs from the same registry
#

my $action='cached';
my $initializer = BioMart::Initializer->new('registryFile'=>$confFile, 'action'=>$action);
my $registry = $initializer->getRegistry;

my $query = BioMart::Query->new('registry'=>$registry,'virtualSchemaName'=>'default');


    $query->setDataset("hsapiens_gene_ensembl");
    $query->addAttribute("ensembl_gene_id");
    $query->addAttribute("external_gene_name");
    $query->addAttribute("refseq_peptide");
    $query->addAttribute("entrezgene");
    $query->addAttribute("refseq_mrna");

$query->formatter("TSV");

my $query_runner = BioMart::QueryRunner->new();
############################## GET COUNT ############################
# $query->count(1);
# $query_runner->execute($query);
# print $query_runner->getCount();
#####################################################################


############################## GET RESULTS ##########################
# to obtain unique rows only
# $query_runner->uniqueRowsOnly(1);

$query_runner->execute($query);
$query_runner->printHeader();
$query_runner->printResults();
$query_runner->printFooter();
#####################################################################`

But if you want to stick with UCSC's mysql interface, it is helpful to explore its table structure and make a union/join query across multiple tables to get what you want. For example: refLink table and gbCdnaInfo have pointers to some of fields you want:

Database: hg19    Primary Table: refLink    Row Count: 435,137   Data last updated: 2015-05-30
Format description: Link together a refseq mRNA and other stuff
field example SQL type info description
name varchar(255) values Name displayed in UI
product varchar(255) values Name of protein product
mrnaAcc NR_025344 varchar(255) values mRNA accession
protAcc varchar(255) values protein accession
geneName 0 int(10) unsigned range pointer to geneName table
prodName 0 int(10) unsigned range pointer to prodName table
locusLinkId 0 int(10) unsigned range Entrez ID (formerly LocusLink ID)
omimId 0 int(10) unsigned range OMIM ID
ADD COMMENT
0
Entering edit mode

Thanks, my friend! It'll take me some time to figure it out.

ADD REPLY

Login before adding your answer.

Traffic: 1965 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6