Hi Everyone!!
I have 100 COVID databases from where I wish to enlist the date of update. Some are regularly updated but some aren't.
I am trying to do this in R. Example given below. I wish to write a function that I can run on those 100 links weekly to record when were those 100 databases updated.
library(stringr)
library(rvest)
library(lubridate)
html <- readLines("https://grafnet.kaust.edu.sa/assayM/")
t <- html[grep(pattern = "(?i)(last update | update | updated)", x = html, ignore.case = TRUE)]
cleanFun <- function(htmlString) {
return(gsub("<.*?>", "", htmlString))
}
t <- cleanFun(t)
dmy(t)
I can't move beyond this step. I know that if the string gets smaller I can use dmy
function of lubridate. But I am poor at regex. can someone help?
Are you sure scraping the
HTML
yields the date of the last update for all the datasets listed there? Did you actually take a look at the page source first?Well, it does for most of them, if not all. I have manually sorted the ones that don't. The problem is I don't get the dates always. Often keywords such as Daily, Monthly, Weekly are also included.
Here are few of the databases out of those 100:
But for some it doesn't: For example: http://covdb.popgenetics.net/v3/index/update
what is the expected output?
The expected output is a dataframe with database name which I have in my excel sheet, the link (again in my excel sheet), and a new column with update date:
I am not sure that HTML scraping & text parsing is the best method for this, obviously it seems to work for a lot of cases but the complexity involved it handling all the possible edge cases seems like a big headache to me. Is there not way to programmatically query the databases directly and check for such things? Or if these are file downloads instead of API's, could you just maintain a database of md5sum's for the files and re-download & compare them to determine if something has changed? You might consider a two-tiered approach where you use your current HTML scraping/parsing method for the 'easy' ones, and take more extreme methods such as actually downloading & checking md5 for the harder cases. In general, determining this from the HTML web pages alone seems not ideal if you care about accuracy, because the web page could change at any time and break you parsing script, better to go directly to the data source if you can.
most of these servers are shinyapps/dashboards and therefore no API.