Question

Parsing Dates from Multiple Websites

0

Entering edit mode

3.6 years ago

rohitsatyam102 ▴ 920

Hi Everyone!!

I have 100 COVID databases from where I wish to enlist the date of update. Some are regularly updated but some aren't.

I am trying to do this in R. Example given below. I wish to write a function that I can run on those 100 links weekly to record when were those 100 databases updated.


library(stringr)
library(rvest)
library(lubridate)

html <- readLines("https://grafnet.kaust.edu.sa/assayM/")
t <- html[grep(pattern = "(?i)(last update | update | updated)", x = html, ignore.case = TRUE)]
cleanFun <- function(htmlString) {
  return(gsub("<.*?>", "", htmlString))
}
t <- cleanFun(t)

dmy(t)

I can't move beyond this step. I know that if the string gets smaller I can use dmy function of lubridate. But I am poor at regex. can someone help?

R • 1.6k views

ADD COMMENT • link updated 3.6 years ago by cpad0112 21k • written 3.6 years ago by rohitsatyam102 ▴ 920

1

Entering edit mode

Are you sure scraping the HTML yields the date of the last update for all the datasets listed there? Did you actually take a look at the page source first?

ADD REPLY • link 3.6 years ago by Dunois ★ 2.8k

0

Entering edit mode

Well, it does for most of them, if not all. I have manually sorted the ones that don't. The problem is I don't get the dates always. Often keywords such as Daily, Monthly, Weekly are also included.

Here are few of the databases out of those 100:

But for some it doesn't: For example: http://covdb.popgenetics.net/v3/index/update

ADD REPLY • link 3.6 years ago by rohitsatyam102 ▴ 920

0

Entering edit mode

what is the expected output?

ADD REPLY • link 3.6 years ago by cpad0112 21k

0

Entering edit mode

The expected output is a dataframe with database name which I have in my excel sheet, the link (again in my excel sheet), and a new column with update date:

assayM	https://grafnet.kaust.edu.sa/assayM/	23-Dec-2020
BioGRID COVID-19 Coronavirus Curation Project	https://thebiogrid.org/project/3/covid-19-coronavirus.html	30-Mar-2021
Contact-guided Iterative Threading ASSEmbly Refinement (C-I-TASSER)	https://zhanglab.ccmb.med.umich.edu/COVID-19/	15-Jan-2021

ADD REPLY • link 3.6 years ago by rohitsatyam102 ▴ 920

0

Entering edit mode

I am not sure that HTML scraping & text parsing is the best method for this, obviously it seems to work for a lot of cases but the complexity involved it handling all the possible edge cases seems like a big headache to me. Is there not way to programmatically query the databases directly and check for such things? Or if these are file downloads instead of API's, could you just maintain a database of md5sum's for the files and re-download & compare them to determine if something has changed? You might consider a two-tiered approach where you use your current HTML scraping/parsing method for the 'easy' ones, and take more extreme methods such as actually downloading & checking md5 for the harder cases. In general, determining this from the HTML web pages alone seems not ideal if you care about accuracy, because the web page could change at any time and break you parsing script, better to go directly to the data source if you can.

ADD REPLY • link 3.6 years ago by steve ★ 3.5k

0

Entering edit mode

most of these servers are shinyapps/dashboards and therefore no API.

ADD REPLY • link 3.6 years ago by rohitsatyam102 ▴ 920

score 1 · Answer 1 · 2021-05-05

1

Entering edit mode

3.6 years ago

cpad0112 21k

> source= "https://grafnet.kaust.edu.sa/assayM/"
> html <- readLines(source)
> t <- html[grep(pattern = "(?i)(last update | update | updated)", x = html, ignore.case = TRUE)]
> cleanFun <- function(x) {
+     return(str_split(t,"Updated | based|update")[[1]][2] %>% 
+                dmy(.))
+ }
> data.frame (source, "date"=cleanFun(t))
                                source       date
1 https://grafnet.kaust.edu.sa/assayM/ 2020-12-23

But this function is not going to work for every site. You may have to write better functions for extracting dates from each website. Btw, Where are you getting the "assayM" from ?

ADD COMMENT • link 3.6 years ago by cpad0112 21k

0

Entering edit mode

Well, there are several ways to skin a cat in R. Here is the code I would prefer to scrap the same information:

library(stringr)
library(rvest)
library(lubridate)
library(dplyr)
library(purrr)

source= "https://grafnet.kaust.edu.sa/assayM/"
html=read_html(source)
html %>% 
    html_elements("h5") %>% 
    html_text2() %>% 
    tibble() %>%  
    slice(1L) %>% 
    str_split(.,"Updated | based") %>% 
    pluck(1,2) %>% 
    dmy()

[1] "2020-12-23"

ADD REPLY • link 3.6 years ago by cpad0112 21k