Question

How to scrape data from UCSC genome browser?

1

Entering edit mode

10.8 years ago

ajstern ▴ 10

I want to compare the quality of different human genome assemblies by looking at their inclusion of the RefSeq genes.

On the UCSC browser I can call the locations of RefSeq genes by their accession numbers in any assembly--for example, https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg16&position=chr22%3A17007506-17034714&hgsid=381962759_r6crkXh3VMlFtCa2rnXaO5BTAjcH.

However, as a newbie at programming in general, I'm unsure how to scrape these inclusions or lack thereof. Anyone have tips?

Assembly genome browser ucsc refseq • 3.7k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.8 years ago by ajstern ▴ 10

score 4 · Answer 1 · 2014-07-06

4

Entering edit mode

10.8 years ago

Pierre Lindenbaum 166k

download the raw data:

http://hgdownload.cse.ucsc.edu/goldenPath/hg16/database/refGene.txt.gz

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz

ADD COMMENT • link 10.8 years ago by Pierre Lindenbaum 166k

score 3 · Answer 2 · 2014-07-06

3

Entering edit mode

10.8 years ago

Devon Ryan 105k

Instead of pissing off UCSC by throwing hundreds of thousands of queries their way, why not just download the various annotation tables (via the table browser or ftp site) and simply process those? That would be vastly simpler than scraping a bunch of web pages.

ADD COMMENT • link 10.8 years ago by Devon Ryan 105k

Ram · Answer 3 · 2014-07-18

2

Entering edit mode

10.8 years ago

Maximilian Haeussler ★ 1.7k

Don't escape it. You can query their mysql server directly:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -NB -e 'select * from refGene'

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.8 years ago by Maximilian Haeussler ★ 1.7k

0

Entering edit mode

I agree, I use mysql for the same job, it's very easy to use and you can also download refseq on your computer. It requires almost no computer power, works fine on laptop too.

ADD REPLY • link 10.8 years ago by madkitty ▴ 690

score 0 · Answer 4 · 2014-07-18

As a former Ensembl team member, I just want to emphasise that scraping websites is absolutely NOT DONE!!! I know of people who were scraping the Ensembl Genome Browser website and were given IP bans because of this (which were lifted again after Ensembl spoke with them and told them how to get the desired data without slowing down / bringing down their production webservers). So, please be aware of this! As already indicated in the other responses, there are many other ways to get the UCSC data (mysql, downloads, Table browser).