Good afternoon,
I am trying to do a DTU analysis for my research, but I am kinda new to this stuff and I have some problems. In particular on point 5). I am following the workflow of Bioconductor vignette rnaseqDTU
and my pipeline is this: 1) read salmon quants
## List all directories containing data
samples <- list.files(path = "../data/quants", full.names = T)
## Obtain a vector of all filenames including the path
files <- file.path(samples, "quant.sf")
## Since all quant files have the same name it is useful to have names for each element
names(files) <- str_replace(samples, "../data/quants/", "") %>%
str_replace(".salmon", "")
2) Then, create sampleTable from an excel file containing clinical info. It is structured like the following:
1 sex age condition
2 M 20 A
3 M 30 B
3) Tximeta imoprt counts with metadata
# Import counts
txi <- tximport(files, type="salmon", txOut=TRUE,
countsFromAbundance="scaledTPM")
rownames(txi$counts) <- gsub("\\..*", "", rownames(txi$counts))
cts <- txi$counts
cts <- cts[rowSums(cts) > 0,]
head(cts)
4) Trascript to geneID
gtf <- "../data/Homo_sapiens.GRCh38.110 (1).gtf"
txdb.filename <- "../data/Homo_sapiens.GRCh38.110.sqlite"
txdb <- makeTxDbFromGFF(gtf)
saveDb(txdb, txdb.filename)
# Load and add geneid
txdb <- loadDb(txdb.filename)
txdf <- select(txdb, keys(txdb, "GENEID"), "TXNAME", "GENEID")
tab <- table(txdf$GENEID)
txdf$ntx <- tab[match(txdf$GENEID, names(tab))]
head(txdf)
5) Check consistency --> ERROR
all(rownames(cts) %in% txdf$TXNAME) # FALSE
My cts
object is this way and has length equal to 21436, instead the txdf$TXNAME
has length 252894
Sample1 Sample2 Sample3
ENST00000000003
ENST00000000004
ENST00000000005
A more detailed image of the cts object: ![cts][2]
Probably I am missing something and I cannot understand what.
Thank you, you were right. I updated my code but the all() keeps returning FALSE. I'll update the question above
What transcript file did you use for the salmon quant? To maintain consistency with downstream analysis I'll usually use gffread to generate the transcript file using the same GTF (and the associated genome assembly) I use later.
I used the following:
Homo_sapiens.GRCh38.110
from ensembl (I used the same in my previous DGE analysis).The link is this one ensembl gtf's