I collected some rsem isoforms.results files and created a data frame with the transcript ID, the isoform percentage, and an experimental condition (mutant or wildtype)
> head(one.var, n=8)
transcript_id IsoPct GENO.COHORT
3246628 ENSMUST00000000028 62.32 MUTANT
3381904 ENSMUST00000000028 61.87 MUTANT
3517180 ENSMUST00000000028 46.42 WILDTYPE
3652456 ENSMUST00000000028 53.99 WILDTYPE
3787732 ENSMUST00000000028 77.04 MUTANT
3923008 ENSMUST00000000028 37.67 MUTANT
4058284 ENSMUST00000000028 88.63 WILDTYPE
4193560 ENSMUST00000000028 69.87 WILDTYPE
> dput(head(one.var, n=8))
structure(list(transcript_id = c("ENSMUST00000000028", "ENSMUST00000000028",
"ENSMUST00000000028", "ENSMUST00000000028", "ENSMUST00000000028",
"ENSMUST00000000028", "ENSMUST00000000028", "ENSMUST00000000028"
), IsoPct = c(62.32, 61.87, 46.42, 53.99, 77.04, 37.67, 88.63,
69.87), GENO.COHORT = c("MUTANT", "MUTANT", "WILDTYPE", "WILDTYPE",
"MUTANT", "MUTANT", "WILDTYPE", "WILDTYPE")), row.names = c(3246628L,
3381904L, 3517180L, 3652456L, 3787732L, 3923008L, 4058284L, 4193560L
), class = "data.frame")
Now I'd like to see what genes have altered transcript proportions in the mutant condition. So I thought I'd start by looking for changes in the percent abundance for individual transcripts, and then merge the gene infraction back in afterwards.
transcript.list <- sort(unique(as.character(one.var$transcript_id)))
last.transcript <- tail(transcript.list, n = 1)
test.res.frame <-
lapply(transcript.list, function(one.transcript) {
temp <- one.var[one.var$transcript_id == one.transcript ,]
print(paste0(one.transcript, " ", last.transcript))
if (length(unique(temp$IsoPct)) > 1) {
test.res <- wilcox.test(IsoPct ~ GENO.COHORT, temp)
return(list(transcript_id = one.transcript, p.value = test.res$p.value))
} else {
return(list(transcript_id = one.transcript, p.value = NA))
}
})
test.res.frame <- do.call(rbind.data.frame, test.res.frame)
test.res.frame$p.adj <- p.adjust(test.res.frame$p.value)
I think this is going to take ~ 40 minutes to loop through 135k transcripts on a single core, and I have other datsets that I want to process in a similar fashion.
Is there some existing base R or Bioconductor function that can do a two-way test like this on all transcripts in bulk? Should I just use something like mclapply?
So far I have only tried RATS, but it's clearly more sophisticated than anything I could have thrown together. It's also faster than what I described above, but it still takes several minutes on a modern Windows laptop. I haven't looked into multi-threading for tximport or RATS yet. I also need to do more reading about the consequences of different kinds of normalization during tximport.
I found browseVignettes("rats") to be the best way to study RATS
I'm looking forward to trying isoformSwitchAnalyzeR since it offers Functional Consequences analysis.