Dear biostar readers,
why is it so much slower to query an sqlite database using RSQlite «from the outside» using param like
statement <-
"SELECT * FROM gene2refseq R
LEFT JOIN gene_info I ON
I.GeneID = R.GeneID
WHERE R.`RNA_nucleotide_accession.version`
LIKE ?"
db <- "gene_info.sqlite"
conn <- DBI::dbConnect(RSQLite::SQLite(), db)
x1 <- DBI::dbGetQuery(conn=conn,
statement=statement,
param=list(Håkan20210914$RNANucleotideAccession))
compared to querying «from the inside» of sqlite, by writing your search terms as a table first, and then calling it
statement <-
"SELECT * FROM H
LEFT JOIN gene2refseq R ON
R.`RNA_nucleotide_accession.version`
LIKE '%' || H.RNANucleotideAccession || '%'
LEFT JOIN gene_info I ON I.GeneID = R.GeneID"
DBI::dbWriteTable(conn, "H", Håkan20210914)
x2 <- DBI::dbGetQuery(conn=conn, statement=statement)
DBI::dbDisconnect(conn)
On my system (E5-2603 v4), the first query took more than an hour, while the second took only a few minutes ...
Do you guys know of any faster (but also nice) way to dig around in very large tsv files like https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz and https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz ?
This is double-posted from https://stat.ethz.ch/pipermail/r-help/2021-October/472465.html, but I thought some people might know more here compared to r-help ...
Best, Rasmus