Okay, So I will answer my question. I tried using Rselenium
because PlasmoDB is Java-based Based so elements like data tables I was interested in would load only when clicked. Since rvest
is only for static HTML pages, this can only be done by Rselenium. After spending a day installing it in Windows 11, I was able to write down the following code that works for me (but it might not be one of the most elegant solutions)
library(RSelenium)
library(netstat)
library(wdman)
library(rvest)
## Before installation of selenium drivers first download openjdk from Azul: https://www.azul.com/downloads/#zulu and install it in the default directory whichever it chooses
## Installation
## First run Selenium Command to install the necessary drivers
# selenium()
#
# ## Now check where these drivers have been installed
# obj <- selenium(retcommand = T,check = F)
## Now you will see 3 directories containing drivers of version <115. For our chorme which is most updated we will first download new driver manually from https://googlechromelabs.github.io/chrome-for-testing/#stable (you can check which chrome version you are running by typing chrome://version/ in the browser). This is the most crucial step. Post downloading use the numeric value to create the folder say 126.0.6478.126 here in Users\rohit_satyam\AppData\Local\binman\binman_chromedriver\win32\ folder. Remember that AppData is a hidden folder and will require you to click on View > Show ? Hidden Items in Windows 11.
getTables <- function(gids){
urls <- paste0("https://plasmodb.org/plasmo/app/record/gene/",gids,"#ExpressionGraphs")
rD1 <- rsDriver(browser = "chrome", port = free_port(),chromever = "latest")
remdr <- rD1$client
results <- lapply(1:length(urls), function(i){
remdr$navigate(urls[i]) ## Go to the page
Sys.sleep(5)
## Now click all the buttons to open the transcriptomics field
firstTask <- remdr$findElements(using = "xpath","//th[contains(@class,'wdk-DataTableCell')]")
lapply(firstTask, function(x){x$clickElement()})
Sys.sleep(5)
## Now within each transcriptomics study open the data table
secondTask <- remdr$findElements(using = "xpath","//h4[contains(@class,'eupathdb-ExpressionGraphsDataTableContainerHeader')]")
lapply(secondTask, function(x){x$clickElement()})
## Let's store the information of the name of the studies.
thirdTask <- remdr$findElements(using = "xpath","//div[contains(@class,'Cell HtmlCell Cell-short_attribution')]")
studyname <- as.character(lapply(1:length(thirdTask), function(x){thirdTask[[x]]$getElementText()[[1]]}))
## Let's get the study description because the study name is not available for all data tables
forthTask <- remdr$findElements(using = "xpath","//div[contains(@class,'Cell HtmlCell Cell-summary')]")
description <- as.character(lapply(1:length(forthTask), function(x){forthTask[[x]]$getElementText()[[1]]}))
## Remove Cao et al since there are no data table for it
## Because there is no data table for Cao et al
remdesc <- which(studyname=="Caro et al.")
studyname <- studyname[!studyname=="Caro et al."]
description <- description[-remdesc]
Sys.sleep(10)
## Get all the datatables now
fifthTask <- remdr$findElements(using = "xpath","//div[starts-with(@class,'DataTable')]")
dt <- fifthTask[[1]]$getPageSource()
page <- read_html(dt %>% unlist())
df <- html_table(page)
df <- df[2:length(df)] ## Since first data table is always a superset of all the DTs in our case
## Add the information of study and the description
modifydf <- lapply(1:length(df), function(x){
df[[x]]$studyname <- studyname[x]
df[[x]]$description <-description[x]
df[[x]]
})
dfFinal <- do.call("rbind", modifydf)
})
return(results)
}
## Testing with some random genes
res <- getTables(c("PF3D7_0518900","PF3D7_0602800","PF3D7_0624600"))
I wanted this solution because PlasmoDB doesn't provide the functionality to download these data tables from multiple Expression Studies. Sometimes my Lab members request me to get these values, and I must copy-paste them. And if the number of genes are high this becomes a frustrating job.
Looks like there is programmatic access via API (https://plasmodb.org/plasmo/app/static-content/content/PlasmoDB/webServices.html ), data downloads for tables (https://plasmodb.org/plasmo/app/downloads ) so why do you need to scrape the content?
Yes I am trying to access it via R. I am aware that I can download tables from PlasmoDB. I am not really well versed with the REST API services and the output was in json format so I thought if there is a way to scrap the webpage itself. Besides, the downloads page does not let me download all the tables that are otherwise present in the page I linked above.
Python has several libraries for scraping that you can use. I can help you if you want. I am unsure if proposing collaboration in this forum is allowed. If not, I will edit my post.
You can post a solution if you wish to. It is your responsibility to make sure that the solution does not violate any restrictions posted on the source website/database.