Question

Batch Download websites from php database

0

Entering edit mode

7.3 years ago

rororo ▴ 10

I am working on nematodes and want to download ESTs from Mark Blaxter's website: http://www.nematodes.org/NeglectedGenomes/NEMATODA/Steinernema/index.html

The programm that curates the database seems outdated, so I guess I need to download them manually. Is there a solution for automating this?

e.g. download all pages in this format: http://www.nematodes.org/NeglectedGenomes/NEMATODA/Steinernema/wwwPartiGene_cluster.php?cluster=SCC*&chosenorganism=SCC

php unix • 2.1k views

ADD COMMENT • link updated 7.3 years ago by LLTommy ★ 1.2k • written 7.3 years ago by rororo ▴ 10

1

Entering edit mode

NCBI has them as well. Should be less painful to get.

ADD REPLY • link 7.3 years ago by GenoMax 149k

0

Entering edit mode

Yup, if there is another way to get this data, I would also recommend to go that way.

ADD REPLY • link 7.3 years ago by LLTommy ★ 1.2k

score 0 · Answer 1 · 2017-10-23

0

Entering edit mode

7.3 years ago

LLTommy ★ 1.2k

This will be painful ... did you try to write the author an email? If he could simply send you the data you would save yourself some time.

If not, it seems like you would have to do html scraping e.g. here you could find more information but yep, it is painful.

Do you know what "clusters" and "organisms" exist on this page? If you do you could try to construct the URL yourself - so change the variables behind the ? accordingly and browse through all the pages you need that way. However, it all is painful.

P.S.: the example you posted doesn't seem to work

P.S.S: php is NOT a database!

ADD COMMENT • link 7.3 years ago by LLTommy ★ 1.2k

0

Entering edit mode

And if you start doing something, I think the 'cluster overview' might be a better start then the index. At least all clusters seem to be mentioned here (?), so you could follow all the links in that column. Would this give you all the information you need?

ADD REPLY • link 7.3 years ago by LLTommy ★ 1.2k

0

Entering edit mode

Yes I know that site and just fetching the cluster number was my first guess, but when I save site, only a small file with the content "Error connecting to the database !" appears. Also happens when I click the link you posted. Thanks for your help!

ADD REPLY • link 7.3 years ago by rororo ▴ 10

0

Entering edit mode

Ah, I see. What I did was to search with the box empty, then you get a site with all the clusters. It is obviously impossible to link to this page. However, this would be where I would start to scrap the page. BUT, check the post above form genomax - maybe you can get the data from somewhere else? It sounds like a lot of effort that is not really connected with your work to try to scrap this from the page, so finding another way would be beneficial.

ADD REPLY • link 7.3 years ago by LLTommy ★ 1.2k