Question

How to scrape this MPMP database?

0

Entering edit mode

16 months ago

rohitsatyam102 ▴ 940

I was trying to write code in R in order to obtain the table as dataframe from a database page. In past using the following rvest has worked for me for other webpages but it's not working for this one

library(rvest)
## didn't work
read_html("https://MPMP.huji.ac.il/maps/VATPASE.html") %>% html_table()
read_html("https://mpmp.huji.ac.il/maps/aminosugmetpath.html") %>% html_nodes("table.table.table-bordered.table-hover")

The code shown above returns an empty list instead of a table as dataframe. Other similar links fails to parse from this database

read_html("https://mpmp.huji.ac.il/Search/PFID?Length=6?Length=6&pfid=PF3D7_0100200") %>% html_nodes("table") %>% html_table(fill = TRUE)

Since I am new to this I don't know how else can I grab the table from this link. Kindly help!!

rvest html web-scrapping • 1.5k views

ADD COMMENT • link updated 16 months ago by Ram 45k • written 16 months ago by rohitsatyam102 ▴ 940

0

Entering edit mode

Why not contact the data owners instead of scraping web pages?

ADD REPLY • link 16 months ago by Ram 45k

0

Entering edit mode

I tried that but I didn't get any reply. Plus the database keep on updating so I though maybe I will write a scraper that can be used every six month to update the pathways.

ADD REPLY • link 16 months ago by rohitsatyam102 ▴ 940

score 3 · Accepted Answer · 2024-01-26

3

Entering edit mode

16 months ago

Ram 45k

You've got to do some inspect source and HTML navigation to get to the table:

dat <- read_html('https://mpmp.huji.ac.il/maps/VATPASE.html')
dat %>% html_element("body") %>% html_elements("div") %>% html_element("table.table-bordered") %>% head(1) %>% html_table()
[[1]]
# A tibble: 13 × 6
   PFID          `PFID Old`  Annotation                                                `Formal Annotation`                                       EC         Transcript
   <chr>         <chr>       <chr>                                                     <chr>                                                     <chr>      <lgl>
 1 PF3D7_0721900 PF07_0090a  V-type ATPase V0 subunit e, putative                      V-type ATPase V0 subunit e, putative                      ""         NA
 2 PF3D7_1354400 MAL13P1.271 V-type proton ATPase 21 kDa proteolipid subunit, putative V-type proton ATPase 21 kDa proteolipid subunit, putative "3.6.3.14" NA
 3 PF3D7_0519200 PFE0965c    V-type proton ATPase c 16 kDa proteolipid subunit         V-type proton ATPase 16 kDa proteolipid subunit           "3.6.3.14" NA
 4 PF3D7_1311900 PF13_0065   V-type proton ATPase catalytic subunit A                  V-type proton ATPase catalytic subunit A                  "3.6.3.14" NA
 5 PF3D7_0806800 PF08_0113   v-type proton atpase subunit a, putative                  v-type proton atpase subunit a, putative                  "3.6.3.14" NA
 6 PF3D7_0406100 PFD0305c    V-type proton ATPase subunit B                            V-type proton ATPase subunit B                            "3.6.3.14" NA
 7 PF3D7_1464700 PF14_0615   V-type proton ATPase subunit c                            ATP synthase (C/AC39) subunit, putative                   "3.6.3.14" NA
 8 PF3D7_0106100 PFA0300c    V-type proton ATPase subunit C, putative                  V-type proton ATPase subunit C, putative                  "3.6.3.14" NA
 9 PF3D7_1341900 PF13_0227   V-type proton ATPase subunit D, putative                  V-type proton ATPase subunit D, putative                  "3.6.3.14" NA
10 PF3D7_0934500 PFI1670c    V-type proton ATPase subunit E, putative                  V-type proton ATPase subunit E, putative                  "3.6.3.14" NA
11 PF3D7_1140100 PF11_0412   V-type proton ATPase subunit F, putative                  V-type proton ATPase subunit F, putative                  "3.6.3.14" NA
12 PF3D7_1323200 PF13_0130   V-type proton ATPase subunit G, putative                  V-type proton ATPase subunit G, putative                  "3.6.3.6"  NA
13 PF3D7_1306600 PF13_0034   V-type proton ATPase subunit H, putative                  V-type proton ATPase subunit H, putative                  "3.6.3.14" NA

I'm using head(1) since I can't seem to find a unique class identifier in a parent element. Digging deeper might help with that and remove the index guessing operation.

EDIT

This is the line you want for all (I'm making an educated guess) URLs in that website:

read_html("https://mpmp.huji.ac.il/maps/aminosugmetpath.html") %>% html_elements("table.table-bordered.table-hover") %>% html_table()

ADD COMMENT • link 16 months ago by Ram 45k

0

Entering edit mode

Hi

When I try your code I get the following:

> dat %>% html_element("body") %>% html_elements("div") %>% html_element("table.table-bordered") 
{xml_nodeset (0)}

I am using rvest v1.0.3

ADD REPLY • link 16 months ago by rohitsatyam102 ▴ 940

0

Entering edit mode

I'm also using rvest_1.0.3. What is the output if you exclude the %>% html_element("table.table-bordered") part from the command you mention above? Also, are you testing with the same URL I used or a different one?

ADD REPLY • link 16 months ago by Ram 45k

0

Entering edit mode

Hi

It's the Same URL. I just copied your code and ran it and the output without %>% html_element("table.table-bordered") is {xml_nodeset (0)}. I also tried it on two different system Linux and Windows.

ADD REPLY • link 16 months ago by rohitsatyam102 ▴ 940

0

Entering edit mode

Okay, this is strange and I don't know the logic behind it. On my Linux system, I created a conda environment and installed rvest using mamba install r::r-rvest and ran your code, which works. However, the same code fails in Rstudio. Do you know what could be causing it?

ADD REPLY • link 16 months ago by rohitsatyam102 ▴ 940

1

Entering edit mode

Check your .libPaths() and compare sessionInfo() between the successful and failed runs.

Side note: You'll need to manually pick the index unless you can find some sort of consistent pattern. For your second URL, this is the command:

read_html("https://mpmp.huji.ac.il/maps/aminosugmetpath.html") %>% html_elements("table") %>% head(2) %>% tail(1) %>% html_table()
[[1]]
# A tibble: 10 × 6
   PFID          `PFID Old`  Annotation                                                                       `Formal Annotation`                                                                                 EC          Transcript
   <chr>         <chr>       <chr>                                                                            <chr>                                                                                               <chr>       <lgl>
 1 PF3D7_0919600 PFI0960w    dolichyl-diphosphooligosaccharide-protein glycotransferase                       added_product=dolichyl-diphosphooligosaccharide--protein glycosyltransferase subunit wbp1, putative "2.4.1.119" NA
 2 PF3D7_0629000 PFF1405c    glucosamine 6P N-acetyltransferase                                               Glucosamine-6P N-acetyltransferase                                                                  "2.3.1.4"   NA
 3 PF3D7_1025100 PF10_0245   Glutamine-fructose-6-phosphate transaminase                                      glutamine--fructose-6-phosphate aminotransferase [isomerizing], putative                            "2.6.1.16"  NA
 4 PF3D7_0624000 PFF1155w    hexokinase                                                                       hexokinase                                                                                          "2.7.1.1"   NA
 5 PF3D7_1434300 PF14_0324   O-GlcNAc transferase                                                             Hsp70/Hsp90 organizing protein                                                                      ""          NA
 6 PF3D7_1130000 PF11_0311   Phosphoacetylglucosamine mutase                                                  phosphoacetylglucosamine mutase, putative                                                           "5.4.2.3"   NA
 7 PF3D7_0211600 PFB0515w    UDP-GlcNAc:dolichyl-pyrophosphoryl-GlcNAc GlcNAc transferase                     udp-n-acetylglucosamine transferase subunit alg14, putative                                         "2.4.1.141" NA
 8 PF3D7_0806400 MAL8P1.133  UDP-GlcNAc:dolichyl-pyrophosphoryl-GlcNAc GlcNAc transferase                     glycosyltransferase family 28 protein, putative                                                     "2.4.1.141" NA
 9 PF3D7_1343600 MAL13P1.218 UDP-N-acetylglucosamine pyrophosphorylase                                        UDP-N-acetylglucosamine pyrophosphorylase, putative                                                 "2.7.7.23"  NA
10 PF3D7_0321200 PFC0935c    UDP-N-acetylglucosamine-dolichyl-phosphate N-acetylglucosaminephosphotransferase UDP-N-acetylglucosamine--dolichyl-phosphate n-acetylglucosaminephosphotransferase, putative         "2.7.8.15"  NA

EDIT

I did a bunch more digging and this will work across all your pages:

read_html("https://mpmp.huji.ac.il/maps/aminosugmetpath.html") %>% html_elements("table.table-bordered.table-hover") %>% html_table()
[[1]]
# A tibble: 10 × 6
   PFID          `PFID Old`  Annotation                                                                       `Formal Annotation`                                                                                 EC          Transcript
   <chr>         <chr>       <chr>                                                                            <chr>                                                                                               <chr>       <lgl>
 1 PF3D7_0919600 PFI0960w    dolichyl-diphosphooligosaccharide-protein glycotransferase                       added_product=dolichyl-diphosphooligosaccharide--protein glycosyltransferase subunit wbp1, putative "2.4.1.119" NA
 2 PF3D7_0629000 PFF1405c    glucosamine 6P N-acetyltransferase                                               Glucosamine-6P N-acetyltransferase                                                                  "2.3.1.4"   NA
 3 PF3D7_1025100 PF10_0245   Glutamine-fructose-6-phosphate transaminase                                      glutamine--fructose-6-phosphate aminotransferase [isomerizing], putative                            "2.6.1.16"  NA
 4 PF3D7_0624000 PFF1155w    hexokinase                                                                       hexokinase                                                                                          "2.7.1.1"   NA
 5 PF3D7_1434300 PF14_0324   O-GlcNAc transferase                                                             Hsp70/Hsp90 organizing protein                                                                      ""          NA
 6 PF3D7_1130000 PF11_0311   Phosphoacetylglucosamine mutase                                                  phosphoacetylglucosamine mutase, putative                                                           "5.4.2.3"   NA
 7 PF3D7_0211600 PFB0515w    UDP-GlcNAc:dolichyl-pyrophosphoryl-GlcNAc GlcNAc transferase                     udp-n-acetylglucosamine transferase subunit alg14, putative                                         "2.4.1.141" NA
 8 PF3D7_0806400 MAL8P1.133  UDP-GlcNAc:dolichyl-pyrophosphoryl-GlcNAc GlcNAc transferase                     glycosyltransferase family 28 protein, putative                                                     "2.4.1.141" NA
 9 PF3D7_1343600 MAL13P1.218 UDP-N-acetylglucosamine pyrophosphorylase                                        UDP-N-acetylglucosamine pyrophosphorylase, putative                                                 "2.7.7.23"  NA
10 PF3D7_0321200 PFC0935c    UDP-N-acetylglucosamine-dolichyl-phosphate N-acetylglucosaminephosphotransferase UDP-N-acetylglucosamine--dolichyl-phosphate n-acetylglucosaminephosphotransferase, putative         "2.7.8.15"  NA

read_html("https://mpmp.huji.ac.il/maps/nitrogenmetpath.html") %>% html_elements("table.table-bordered.table-hover") %>% html_table()
[[1]]
# A tibble: 11 × 6
   PFID          `PFID Old` Annotation                        `Formal Annotation`                    EC         Transcript
   <chr>         <chr>      <chr>                             <chr>                                  <chr>      <lgl>
 1 PF3D7_1308200 PF13_0044  Carbamoyl-phosphate synthase      carbamoyl phosphate synthetase         "6.3.5.5"  NA
 2 PF3D7_0720400 PF07_0085  Ferredoxin-NADP+ reductase        ferrodoxin reductase-like protein      "1.18.1.2" NA
 3 PF3D7_0720400 PF07_0085  ferrodoxin reductase-like protein ferrodoxin reductase-like protein      "1.18.1.2" NA
 4 PF3D7_0802000 PF08_0132  Glutamate dehydrogenase (NAD)     glutamate dehydrogenase, putative      "1.4.1.2"  NA
 5 PF3D7_1416500 PF14_0164  Glutamate dehydrogenase (NADP)    NADP-specific glutamate dehydrogenase  "1.4.1.4"  NA
 6 PF3D7_1430700 PF14_0286  Glutamate dehydrogenase (NADP)    NADP-specific glutamate dehydrogenase  "1.4.1.4"  NA
 7 PF3D7_0922600 PFI1110w   Glutamate-ammonia ligase          glutamine synthetase, putative         "6.3.1.2"  NA
 8 PF3D7_0720400 PF07_0085  Nitrate reductase                 ferrodoxin reductase-like protein      "1.7.1.1"  NA
 9 PF3D7_1367500 PF13_0353  Nitrate reductase                 NADH-cytochrome b5 reductase, putative "1.6.2.2"  NA
10 PF3D7_1434000 PF14_0321  Nitrate transporter               CCR4-associated factor 16, putative    ""         NA
11 PF3D7_0316600 PFC0725c   Nitrite transporter               formate-nitrite transporter, FNT       ""         NA

ADD REPLY • link 16 months ago by Ram 45k

0

Entering edit mode

I tried doing that. The conda environment contains one version lower i.e. rvest v 1.0.2. So I thought let's downgrade rvest in rstudio and so I did devtools::install_version("rvest",version = "1.0.2"). However, the error was persistent. My rvest works perfectly on other links but strangely not on links of this database. A similar and unusual observation was made by me two weeks ago and I reported it here

ADD REPLY • link 16 months ago by rohitsatyam102 ▴ 940

0

Entering edit mode

Try running curl on the URLs from both environments (system("curl <URL>")) and check if you can access the pages.

ADD REPLY • link 16 months ago by Ram 45k