I need to create several transcript database objects from different GTF files, so I would like to save time by running each txdb creation in parallel.
However, the txdb that is generated when I have run it in parallel (although it completes without errors) doesn't behave like the one generated by a single instance of the function. At first I thought it might have to do with the way I wrapped it in a function, but that does not seem to be the problem.
I do not understand why in the minimal example below, "txdb" and "txdb2" are valid and "txdb3" is not. Anyone have any ideas?
> require(GenomicFeatures)
>
> txdb <- makeTxDbFromGFF("gtf_files/exons_final_sorted.gtf",format="gtf")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: closing unused connection 4 (<-mordor-PC:11950)
2: closing unused connection 3 (<-mordor-PC:11950)
3: Named parameters not used in query: internal_chrom_id, chrom, length, is_circular
4: Named parameters not used in query: internal_id, name, type, chrom, strand, start, end
5: Named parameters not used in query: internal_id, name, chrom, strand, start, end
6: Named parameters not used in query: internal_tx_id, exon_rank, internal_exon_id, internal_cds_id
7: Named parameters not used in query: gene_id, internal_tx_id
>
> test <- function(gtffile="gtf_files/exons_final_sorted.gtf", ftype="gtf"){
+ nuTxDb <- makeTxDbFromGFF(file=gtffile, format = ftype)
+ return(nuTxDb)
+ }
>
> txdb2 <- test(gtffile="gtf_files/exons_final_sorted.gtf", ftype="gtf")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: Named parameters not used in query: internal_chrom_id, chrom, length, is_circular
2: Named parameters not used in query: internal_id, name, type, chrom, strand, start, end
3: Named parameters not used in query: internal_id, name, chrom, strand, start, end
4: Named parameters not used in query: internal_tx_id, exon_rank, internal_exon_id, internal_cds_id
5: Named parameters not used in query: gene_id, internal_tx_id
>
>
> cluster <- makeCluster(2)
> dbeez <- clusterMap(cluster, makeTxDbFromGFF, file = c("gtf_files/exons_final_sorted.gtf", "gtf_files/exons_final_sorted.gtf"), format=c("gtf","gtf"))
> txdb3 <- dbeez[[1]]
>
> typeof(txdb)
[1] "S4"
> txdb
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: gtf_files/exons_final_sorted.gtf
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# transcript_nrow: 162738
# exon_nrow: 547144
# cds_nrow: 0
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2017-12-30 15:20:30 -0500 (Sat, 30 Dec 2017)
# GenomicFeatures version at creation time: 1.26.4
# RSQLite version at creation time: 1.1-2
# DBSCHEMAVERSION: 1.1
>
> typeof(txdb2)
[1] "S4"
> txdb2
TxDb object:
# Db type: TxDb
# Supporting package: GenomicFeatures
# Data source: gtf_files/exons_final_sorted.gtf
# Organism: NA
# Taxonomy ID: NA
# miRBase build ID: NA
# Genome: NA
# transcript_nrow: 162738
# exon_nrow: 547144
# cds_nrow: 0
# Db created by: GenomicFeatures package from Bioconductor
# Creation time: 2017-12-30 15:21:10 -0500 (Sat, 30 Dec 2017)
# GenomicFeatures version at creation time: 1.26.4
# RSQLite version at creation time: 1.1-2
# DBSCHEMAVERSION: 1.1
>
> typeof(txdb3)
[1] "S4"
> txdb3
TxDb object:
Error in rsqlite_send_query(conn@ptr, statement) :
external pointer is not valid
>
Also, it seems that I cannot use a previously created and seemly good txdb object in clusterMap function, even though the exact same object and parameters works outside of it:
> require(GenomicFeatures)
>
>
> txdb <- makeTxDbFromGFF("gtf_files/exons_final_sorted.gtf",format="gtf")
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Warning messages:
1: closing unused connection 4 (<-mordor-PC:11299)
2: closing unused connection 3 (<-mordor-PC:11299)
3: Named parameters not used in query: internal_chrom_id, chrom, length, is_circular
4: Named parameters not used in query: internal_id, name, type, chrom, strand, start, end
5: Named parameters not used in query: internal_id, name, chrom, strand, start, end
6: Named parameters not used in query: internal_tx_id, exon_rank, internal_exon_id, internal_cds_id
7: Named parameters not used in query: gene_id, internal_tx_id
>
> list.per.geneA <- transcriptsBy(x=txdb, by="exon", use.names = TRUE)
Warning message:
In .set_group_names(grl, use.names, txdb, by) :
some group names are NAs or duplicated
> list.per.geneB <- transcriptsBy(x=txdb, by="gene", use.names = FALSE)
> require(parallel)
> cluster <- makeCluster(2)
> list.per.gene <- clusterMap(cluster, transcriptsBy, x=c(txdb, txdb), by=c("exon", "gene"), use.names=c(TRUE,FALSE))
Error in checkForRemoteErrors(val) :
2 nodes produced errors; first error: invalid DB file
Most parallelization in R works via a
fork()
. I suspect that what you're seeing is that forked threads are getting invalid database handles.Apologies if this is a naive question, but is that something I can do something about?
No, that's not something you can do anything about. I suspect you simply can't use
clusterMap()
for this.