How to scrape data from UCSC genome browser?
4
1
Entering edit mode
10.4 years ago
ajstern ▴ 10

I want to compare the quality of different human genome assemblies by looking at their inclusion of the RefSeq genes.

On the UCSC browser I can call the locations of RefSeq genes by their accession numbers in any assembly--for example, https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg16&position=chr22%3A17007506-17034714&hgsid=381962759_r6crkXh3VMlFtCa2rnXaO5BTAjcH.

However, as a newbie at programming in general, I'm unsure how to scrape these inclusions or lack thereof. Anyone have tips?

Assembly genome browser ucsc refseq • 3.5k views
ADD COMMENT
3
Entering edit mode
10.4 years ago

Instead of pissing off UCSC by throwing hundreds of thousands of queries their way, why not just download the various annotation tables (via the table browser or ftp site) and simply process those? That would be vastly simpler than scraping a bunch of web pages.

ADD COMMENT
2
Entering edit mode
10.4 years ago

Don't escape it. You can query their mysql server directly:

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -NB -e 'select * from refGene'
ADD COMMENT
0
Entering edit mode

I agree, I use mysql for the same job, it's very easy to use and you can also download refseq on your computer. It requires almost no computer power, works fine on laptop too.

ADD REPLY
0
Entering edit mode
10.4 years ago
Bert Overduin ★ 3.7k

As a former Ensembl team member, I just want to emphasise that scraping websites is absolutely NOT DONE!!! I know of people who were scraping the Ensembl Genome Browser website and were given IP bans because of this (which were lifted again after Ensembl spoke with them and told them how to get the desired data without slowing down / bringing down their production webservers). So, please be aware of this! As already indicated in the other responses, there are many other ways to get the UCSC data (mysql, downloads, Table browser).

ADD COMMENT

Login before adding your answer.

Traffic: 2992 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6