Parsing Dates from Multiple Websites
1
0
Entering edit mode
3.6 years ago

Hi Everyone!!

I have 100 COVID databases from where I wish to enlist the date of update. Some are regularly updated but some aren't.

I am trying to do this in R. Example given below. I wish to write a function that I can run on those 100 links weekly to record when were those 100 databases updated.


library(stringr)
library(rvest)
library(lubridate)

html <- readLines("https://grafnet.kaust.edu.sa/assayM/")
t <- html[grep(pattern = "(?i)(last update | update | updated)", x = html, ignore.case = TRUE)]
cleanFun <- function(htmlString) {
  return(gsub("<.*?>", "", htmlString))
}
t <- cleanFun(t)

dmy(t)

I can't move beyond this step. I know that if the string gets smaller I can use dmy function of lubridate. But I am poor at regex. can someone help?

R • 1.7k views
ADD COMMENT
1
Entering edit mode

Are you sure scraping the HTML yields the date of the last update for all the datasets listed there? Did you actually take a look at the page source first?

ADD REPLY
0
Entering edit mode

Well, it does for most of them, if not all. I have manually sorted the ones that don't. The problem is I don't get the dates always. Often keywords such as Daily, Monthly, Weekly are also included.

Here are few of the databases out of those 100:

  1. https://91-divoc.com/pages/covid-visualization/
  2. https://covidibd.org/current-data/
  3. https://cov3d.ibbr.umd.edu/

But for some it doesn't: For example: http://covdb.popgenetics.net/v3/index/update

ADD REPLY
0
Entering edit mode

what is the expected output?

ADD REPLY
0
Entering edit mode

The expected output is a dataframe with database name which I have in my excel sheet, the link (again in my excel sheet), and a new column with update date:

assayM https://grafnet.kaust.edu.sa/assayM/ 23-Dec-2020
BioGRID COVID-19 Coronavirus Curation Project https://thebiogrid.org/project/3/covid-19-coronavirus.html 30-Mar-2021
Contact-guided Iterative Threading ASSEmbly Refinement (C-I-TASSER) https://zhanglab.ccmb.med.umich.edu/COVID-19/ 15-Jan-2021
ADD REPLY
0
Entering edit mode

I am not sure that HTML scraping & text parsing is the best method for this, obviously it seems to work for a lot of cases but the complexity involved it handling all the possible edge cases seems like a big headache to me. Is there not way to programmatically query the databases directly and check for such things? Or if these are file downloads instead of API's, could you just maintain a database of md5sum's for the files and re-download & compare them to determine if something has changed? You might consider a two-tiered approach where you use your current HTML scraping/parsing method for the 'easy' ones, and take more extreme methods such as actually downloading & checking md5 for the harder cases. In general, determining this from the HTML web pages alone seems not ideal if you care about accuracy, because the web page could change at any time and break you parsing script, better to go directly to the data source if you can.

ADD REPLY
0
Entering edit mode

most of these servers are shinyapps/dashboards and therefore no API.

ADD REPLY
1
Entering edit mode
3.6 years ago
> source= "https://grafnet.kaust.edu.sa/assayM/"
> html <- readLines(source)
> t <- html[grep(pattern = "(?i)(last update | update | updated)", x = html, ignore.case = TRUE)]
> cleanFun <- function(x) {
+     return(str_split(t,"Updated | based|update")[[1]][2] %>% 
+                dmy(.))
+ }
> data.frame (source, "date"=cleanFun(t))
                                source       date
1 https://grafnet.kaust.edu.sa/assayM/ 2020-12-23

But this function is not going to work for every site. You may have to write better functions for extracting dates from each website. Btw, Where are you getting the "assayM" from ?

ADD COMMENT
0
Entering edit mode

Well, there are several ways to skin a cat in R. Here is the code I would prefer to scrap the same information:

library(stringr)
library(rvest)
library(lubridate)
library(dplyr)
library(purrr)

source= "https://grafnet.kaust.edu.sa/assayM/"
html=read_html(source)
html %>% 
    html_elements("h5") %>% 
    html_text2() %>% 
    tibble() %>%  
    slice(1L) %>% 
    str_split(.,"Updated | based") %>% 
    pluck(1,2) %>% 
    dmy()

[1] "2020-12-23"
ADD REPLY

Login before adding your answer.

Traffic: 1359 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6