How to scrap PlasmoDB website
1
1
Entering edit mode
9 weeks ago

Hi everyone

I was wondering if there is a way to scrap the PlasmoDB website using rvest. I have few links of genes (for example like this:: https://plasmodb.org/plasmo/app/record/gene/PF3D7_0418300) from where I wish to scrap all the tables present in each section such as genomics location, literature etc) but you can see that there are drop down arrows which when clicked display table rather than static tables.

I am confused how to use rvest for this.

Sincerely

plasmoDB rvest tidyr • 645 views
ADD COMMENT
1
Entering edit mode

Looks like there is programmatic access via API (https://plasmodb.org/plasmo/app/static-content/content/PlasmoDB/webServices.html ), data downloads for tables (https://plasmodb.org/plasmo/app/downloads ) so why do you need to scrape the content?

ADD REPLY
0
Entering edit mode

Yes I am trying to access it via R. I am aware that I can download tables from PlasmoDB. I am not really well versed with the REST API services and the output was in json format so I thought if there is a way to scrap the webpage itself. Besides, the downloads page does not let me download all the tables that are otherwise present in the page I linked above.

ADD REPLY
0
Entering edit mode

Python has several libraries for scraping that you can use. I can help you if you want. I am unsure if proposing collaboration in this forum is allowed. If not, I will edit my post.

ADD REPLY
0
Entering edit mode

You can post a solution if you wish to. It is your responsibility to make sure that the solution does not violate any restrictions posted on the source website/database.

ADD REPLY
0
Entering edit mode
9 weeks ago

Okay, So I will answer my question. I tried using Rselenium because PlasmoDB is Java-based Based so elements like data tables I was interested in would load only when clicked. Since rvest is only for static HTML pages, this can only be done by Rselenium. After spending a day installing it in Windows 11, I was able to write down the following code that works for me (but it might not be one of the most elegant solutions)

library(RSelenium)
library(netstat)
library(wdman)
library(rvest)

## Before installation of selenium drivers first download openjdk from Azul: https://www.azul.com/downloads/#zulu and install it in the default directory whichever it chooses

## Installation
## First run Selenium Command to install the necessary drivers
# selenium()
# 
# ## Now check where these drivers have been installed
# obj <- selenium(retcommand = T,check = F)

## Now you will see 3 directories containing drivers of version <115. For our chorme which is most updated we will first download new driver manually from https://googlechromelabs.github.io/chrome-for-testing/#stable (you can check which chrome version you are running by typing chrome://version/ in the browser). This is the most crucial step. Post downloading use the numeric value to create the folder say 126.0.6478.126 here in Users\rohit_satyam\AppData\Local\binman\binman_chromedriver\win32\ folder. Remember that AppData is a hidden folder and will require you to click on View > Show ? Hidden Items in Windows 11.


getTables <- function(gids){
  urls <- paste0("https://plasmodb.org/plasmo/app/record/gene/",gids,"#ExpressionGraphs")
  rD1 <- rsDriver(browser = "chrome", port = free_port(),chromever =  "latest")
  remdr <- rD1$client

  results <- lapply(1:length(urls), function(i){
    remdr$navigate(urls[i]) ## Go to the page
    Sys.sleep(5)
    ## Now click all the buttons to open the transcriptomics field
    firstTask <- remdr$findElements(using = "xpath","//th[contains(@class,'wdk-DataTableCell')]")
    lapply(firstTask, function(x){x$clickElement()})

    Sys.sleep(5)
    ## Now within each transcriptomics study open the data table
    secondTask <- remdr$findElements(using = "xpath","//h4[contains(@class,'eupathdb-ExpressionGraphsDataTableContainerHeader')]")
    lapply(secondTask, function(x){x$clickElement()})


    ## Let's store the information of the name of the studies. 
    thirdTask <- remdr$findElements(using = "xpath","//div[contains(@class,'Cell HtmlCell Cell-short_attribution')]")
    studyname <- as.character(lapply(1:length(thirdTask), function(x){thirdTask[[x]]$getElementText()[[1]]}))

    ## Let's get the study description because the study name is not available for all data tables
    forthTask <- remdr$findElements(using = "xpath","//div[contains(@class,'Cell HtmlCell Cell-summary')]")
    description <- as.character(lapply(1:length(forthTask), function(x){forthTask[[x]]$getElementText()[[1]]}))

    ## Remove Cao et al since there are no data table for it

    ## Because there is no data table for Cao et al
    remdesc <- which(studyname=="Caro et al.")
    studyname <- studyname[!studyname=="Caro et al."]
    description <- description[-remdesc]
    Sys.sleep(10)
    ## Get all the datatables now

    fifthTask <- remdr$findElements(using = "xpath","//div[starts-with(@class,'DataTable')]")
    dt <- fifthTask[[1]]$getPageSource()
    page <- read_html(dt %>% unlist()) 
    df <- html_table(page)
    df <- df[2:length(df)] ## Since first data table is always a superset of all the DTs in our case

    ## Add the information of study and the description
    modifydf <- lapply(1:length(df), function(x){
      df[[x]]$studyname <- studyname[x]
      df[[x]]$description <-description[x]
      df[[x]]
    })

    dfFinal <- do.call("rbind", modifydf)
  })
  return(results)
}

## Testing with some random genes

res <- getTables(c("PF3D7_0518900","PF3D7_0602800","PF3D7_0624600"))

I wanted this solution because PlasmoDB doesn't provide the functionality to download these data tables from multiple Expression Studies. Sometimes my Lab members request me to get these values, and I must copy-paste them. And if the number of genes are high this becomes a frustrating job.

ADD COMMENT
0
Entering edit mode

My Sincere gratitude to Samer Hijjazi for making YouTube tutorial on how to use Rselenium otherwise the documentation is very limited.

ADD REPLY

Login before adding your answer.

Traffic: 950 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6