Is There Such A Thing As A Ucsc Api?
11
16
Entering edit mode
14.1 years ago

Firstly, is there any real difference between the data stored at UCSC and that at EnsEMBL? If so, I am wondering how to programmatically retrieve genomic information from UCSC in a similar manner to the EnsEMBL Perl API?

I notice that Jan Aerts has started developing a ruby-ucsc-api , but I'm not sure how complete that is? https://github.com/jandot/ruby-ucsc-api

I'm developing in Python so would like to use that primarily if possible. I need to retrieve genes, transcripts, introns, exons, repeats etc.

ucsc python comparative api • 16k views
ADD COMMENT
2
Entering edit mode

Given what you want to get, I'd suggest the ruby-ensembl-api :-)

ADD REPLY
2
Entering edit mode

I've added an example of how to print exons and introns using the ensembl API in examples_perl_tutorial.rb

ADD REPLY
2
Entering edit mode

To make the library efficient, one should change overlap_sql() in https://github.com/jandot/ruby-ucsc-api/blob/master/lib/ucsc/hg18/activerecord.rb

ADD REPLY
1
Entering edit mode

No, that API will be very inefficient. All the magic of UCSC is the "bin" field. If you do not use that, you will lose most of the power of UCSC!

ADD REPLY
0
Entering edit mode

Which tracks do you need access to?

ADD REPLY
0
Entering edit mode

Here is a description of the bin field, with links to implementations in C, Perl, Python and Ruby: http://genomewiki.ucsc.edu/index.php/Bin_indexing_system

ADD REPLY
18
Entering edit mode
14.1 years ago
Jandot ▴ 370

Hi there,

You're correct that I have started a UCSC API in ruby at http://github.com/jandot/ruby-ucsc-api. Making an API for UCSC is however not straightforward. This has nothing to do with complexity, but with the number of tables. If I remember correctly there are >1,200 tables in the UCSC database. This all has to do with the fact that (in contrast to Ensembl) the UCSC database is organized specifically so that it works fast in the genome browser (Ensembl uses a more normalized scheme). As you might understand I didn't spend days/weeks to go through all those tables and create the API for all of them if I wouldn't need them myself. Instead, I created the general framework to get it working, and only created an API to those tables that I needed in my work at that moment (which are the ones that are related to CNVs). In other words: tables are only added to the API on an as-needed basis.

If you have a look in ruby-ucsc-api/lib/ucsc/hg18/activerecord.rb, you can see that to create the API a particular table looks like this:

class CnpRedon < DBConnection
  include Ucsc::Hg18::Feature

  set_table_name 'cnpRedon'
  set_primary_key nil

  def self.find_by_slice(slice)
    start = slice.range.begin
    stop = slice.range.end
    return CnpRedon.find_by_sql('SELECT * FROM cnpRedon' + overlap_sql(slice, start, stop))
  end
end

Everything is set up so that it is straightforward to add new tables. So (given that you would like to use the ruby language), I suggest you clone/fork the API, and then copy/paste/modify the above code snippet to add the tables you need. This is work that takes seconds, rather than minutes for each table :-)

ADD COMMENT
2
Entering edit mode

-1 for not using the "bin" field. This will lead to very inefficient retrieval.

ADD REPLY
2
Entering edit mode

Hi Heng. I know about the bin field; you might remember we discussed this together with James Bonfield in the DiNA long ago. The ruby-ucsc-api was created as an ad-hoc solution for something Klaudia Walter needed, and the onus was on getting it working as soon as possible. If I have time I will definitely rewrite the overlap_sql function to take bins into account. (Or as you have already done this in perl: nothing should stop you from cloning the git repo and changing the function yourself :-)

ADD REPLY
1
Entering edit mode

I will make it simpler so that you only have to add the following to the API:

class CnpRedon <; DBConnection
  include Ucsc::Hg18::Feature

  set_table_name 'cnpRedon'
  set_primary_key nil
end

In other words: I'll move the self.find_by_slice somewhere else.

ADD REPLY
0
Entering edit mode

Thank Jan! I'll take a good look at the IP and hopefully make some additions for my requirements, after I've read through the Schattner book and the UCSC MySQL information :)

ADD REPLY
0
Entering edit mode

I meant API, not IP, lol!

ADD REPLY
14
Entering edit mode
14.1 years ago
lh3 33k

I saw potential misuses of UCSC MySQL, so I decided to add a comment. It is partially an answer as well.

All the magic of UCSC MySQL is the "bin" field. This substantially improves the retrieval speed in large tables (e.g. est and snp). You can choose not to use the bin field, but this is not optimal. One should read the UCSC paper for the details on how to use the "bin" field. If you want to query by yourself, please understand "bin" first, for you and for other users connecting to the same database.

I used to write a command-line tool (source code is here) for generic data retrieval. The selling point is you do not need to modify the source code to add new tables. In most cases, you only need to provide the table name. If you know the table schema and a little SQL, you can do more powerful things. Although I seldom use it, this script is one of the smartest scripts I have written.

The script is fairly short and simple and it should not be hard to change it to a Perl library (that is why is a partial answer).

EDIT: a few use cases:

  • retrieve known genes:

    echo "chr1 1 1000000" | ./batchUCSC.pl -d hg18 -p 'knownGene:::'

  • count # genes but excluding UTRs:

    echo "chr1 1 1000000" | ./batchUCSC.pl -d hg19 -p 'refGene:cdsStart:cdsEnd:COUNT(*)'

  • count # exons:

    echo "chr1 1 1000000" | ./batchUCSC.pl -d hg19 -p 'refGene:::SUM(exonCount)'

ADD COMMENT
0
Entering edit mode

Here is a description of the "bin" field at UCSC, with links to Heng Li's implementations, Jandot's implementation added after this discussion here to bio-ruby, and the cruz-db python version added a few years later: http://genomewiki.ucsc.edu/index.php/Bin_indexing_system

ADD REPLY
14
Entering edit mode
13.9 years ago

Update 2019: UCSC does have an API now: http://api.genome.ucsc.edu

What I wrote below nine years ago when I was a postdoc, not at UCSC and on a different continent, may still be relevant, as the API doesn't cover everything yet.

Also UCSC has a page about the bin field now: http://genomewiki.ucsc.edu/index.php/Bin_indexing_system


External (Ruby or Python) APIs

You can try to play around with a hacked-together Ruby or Python API that accesses the tables via MySQL. You could write your own in Java. But they are not supported officially and just hacked together by a single programmer. In my opinion, they may prove to waste your time in the end.

Actually, you do not need an API, the UCSC table browser provides ample help to construct SQL commands, describing each and every field and all relationships between all tables. You can then access the tables yourself with your MySQL client (though the binning scheme will make them a lot faster, as Heng wisely pointed out), no need for an API. I see this as an advantage, as you have one dependency less, no external library that can break and messy updates.

* Avoid DAS * Direct MySql access is more general than using the overly complex DAS XML format, which will only give you the chrom-start-stop-like annotations, not X-Y plots and not any of the special formats that UCSC is using (chain, net, psl, wiggle, maf, etc). There is no point in using the UCSC database if you cannot access the advanced data fields.

A book? Peter Schattner's book is interesting in that he writes the same software with several different APIs and you can compare the implementations. But I am unsure if I would buy it just for the two chapters (chapter 9 and 10, 148 -214). They offer, however, a very good introduction to the topic, so you might buy it nevertheless.

The UCSC API The true power of the UCSC API and the key to its speed is only accessible from C, because the genome browser is written in C and so is the API. If you know some C you should be able to figure out how it works quite quickly. Download the source of the UCSC tools and compile the libraries and tools, following the instructions on http://genome.ucsc.edu/admin/jk-install.html.

Then, this is the most important part, search for something that is similar to what you are planning to do. Want to parse a 2bit file - look at twoBitToFasta.c. Want to get information on how to load tables into C structures - look at e.g. featureBits.c. Want to know how to map between genomes - look at liftOver.c. And so on. Copy-and-pasting will get you very far, given that there are >150k lines of code to look at. And, to take into account Heng's comment, it will take care of the bin-field automatically.

Don't forget that everything that the UCSC guys do is very well documented in their makeDB files on http://hgwdev.cse.ucsc.edu/~kent/src/unzipped/hg/makeDb/doc/ and that a lot of stuff is documented at other places (use my page on http://genomewiki.ucsc.edu/index.php/Learn_about_the_Browser as a reference sheet). The makeDb files should show you which tool you need to look at. If you don't know what tool is most similar to your task, then send an email to their mailing list to ask for the name of a tool that does xyz - there are >750 tools in the source tree, so there often is something already somewhere in their code.

When you're doing this the first time, it will take more time to set up than if you use a famous Perl Bioinformatics API, but it will produce stable and very very fast code. You can also any genomics problem with the API and will be able to use the code over years. The example in Schattner's book is several times (219 seconds with the Ensembl API versus 6 seconds with the UCSC API, page 176, second paragraph) faster than the version with the Perl API. In addition, your C code will never break due to a version change somewhere on the internet, as it run on local textfiles.

ADD COMMENT
0
Entering edit mode

"There is no point in using the UCSC database if you cannot access the advanced data fields." huh? what should he use for interval data?

ADD REPLY
1
Entering edit mode

Use the tools. bigWigToWig with the -seq, -start and -end options. It accepts a URL like, this:

bigWigToBedGraph http://hgdownload.cse.ucsc.edu/gbdb/hg19/bbi/wgEncodeBroadHistoneK562Cbx2Sig.bigWig -chrom=chr21 -start=0 -end=1000000 stdout

Same works for bigBedToBed.

For bed files, use the tool overlapSelect. I just realized that we need to document this somewhere...

ADD REPLY
7
Entering edit mode
14.1 years ago

As said GWW there is a public mysql server for the UCSC (http://genome.ucsc.edu/FAQ/FAQdownloads.html#download29) but there is also a DAS server: http://genome.ucsc.edu/FAQ/FAQdownloads.html#download23

On my side I tried to generate some java classes to query ensembl/ucsc using the XML definitions of the tables. see this post

ADD COMMENT
6
Entering edit mode
14.1 years ago
Gww ★ 2.7k

You can directly access their mysql database using the information here. You could also run your own local copy of their database (or just a selection of tables that you are interested in).

In general, there is a lot of overlap between EnsEMBL and UCSC, but they do have different gene prediction algorithms and different data tracks.

ADD COMMENT
5
Entering edit mode
ADD COMMENT
0
Entering edit mode

More info on cruzdb at http://arxiv.org/abs/1303.3332

ADD REPLY
4
Entering edit mode
14.1 years ago

Chapter's 9 & 10 of Peter Schattner's book Genomes, Browsers and Databases describes the UCSC C API (aka the "Kent source tree") in some detail.

ADD COMMENT
1
Entering edit mode

Hi Casey, many thanks! I've been meaning to get a hold of a copy of that book! Perhaps a good reason to make the purchase :) I might check if they have one in the library here in the meantime?!

ADD REPLY
3
Entering edit mode
14.1 years ago

The pygr project is an interesting approach to the problem of data access from multiple sources. In python, though, it is pretty simple to use a light wrapper around SQLalchemy to get at the tabular data stored in their public mysql database in a pythonic way. You can get at the sequence data using a tool like pyfasta. And working with bx-python is great for very fast interval manipulations and matching.

ADD COMMENT
2
Entering edit mode
13.8 years ago
Botond Sipos ★ 1.7k

The Genoman Perl module supports the retrieval of annotation both from EnsEMBL and UCSC.

ADD COMMENT
2
Entering edit mode
13.6 years ago

I am developing BioRuby-UCSC-API, a BioRuby plugin based on Jan Aerts' ruby-ucsc-api.

Features of BioRuby-UCSC-API are the followings:

  • Using ActiveRecord as an O/R mapper (smililar to ruby-ensembl-api and ruby-ucsc-api)
  • Using the Bin index system to improve query performance
  • Automatic conversion of "1-based full-closed intervals" to internal "0-based half-closed intervals"
  • Version 0.0.4 supports almost all the tables in the hg19 database. But table relations are not completed.
  • Supporting reference sequence retrieval from locally-stored 2bit files. Official MySQL server does not support this function.
  • Supporting local (mirror) MySQL servers

This package is still experimental. Your comments, suggestions and requests are welcome.

BioRuby-UCSC-API is available at

ADD COMMENT
1
Entering edit mode
8.6 years ago

I understand this is a very old post, but if it helps anyone who is searching for programmatic access directly using UCSC tools, I found the below link
http://genomewiki.ucsc.edu/index.php/Programmatic_access_to_the_Genome_Browser

It has details on

1) How to download data from their MySQL database

2) Get Chromosome sequence for a range (using REST API, which was what I was looking for)

... and few such things including accessing a copy of current Genome browser image

Hope this helps!

ADD COMMENT

Login before adding your answer.

Traffic: 2496 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6