Question

Using NumReads column of quants.genes.sf file to build DESeq2 input

0

Entering edit mode

8 weeks ago

Ezequiel ▴ 10

I have a question I've been meaning to ask for a while. To run DESeq2 I need to provide as input a table of counts without normalizing. For that, I use tximport which takes all the quants.sf files of my samples after using salmon and using a reference I can generate the input needed to run DESeq2.

I ran:

samples <- list.files(path = ".", full.names = T, pattern="SRR")
files <- file.path(samples, "quant.sf")
names(files) <- str_replace(samples, "./", "") %>% str_replace("ERR", "ERR")
tx2gene <- read.delim("/home/reference_Hsapiens/gene2transcript.coding.txt")
txi <- tximport(files, type="salmon", tx2gene=tx2gene[,c("transcript_id", "gene_id")], countsFromAbundance="lengthScaledTPM")
data <- txi$counts %>% round() %>% data.frame()

My question is: why do I have to use the quants.sf table with a genes2transcript reference? Can't I just use my quants.genes.sf table and build a counts table without normalizing using the NumReads column? I feel that this is the same as using tximport with quants.sf

R deseq2 rna-seq salmon • 322 views

ADD COMMENT • link updated 8 weeks ago by i.sudbery 20k • written 8 weeks ago by Ezequiel ▴ 10

score 1 · Answer 1 · 2024-10-31

One of the improvements that the using the salmon -> tximport -> DESeq2 pipeline, over just a counts->DESeq2 pipeline is the ability to account for changes transcript usage on gene-level counts. Negative Binomial based differential expression analyses algorithms rely on the assumption that they are comparing the same piece of sequence between the two samples. This is why you can't compare the expression to gene A to gene B using the NB based methods - obvious a 1000bp gene will have more reads than a 500bp gene expressed at the same level. When we do DE, we assume we compare the same sequence in two different samples, and so any bias' caused by the length or composition of that sequence are the same in both samples and cancel out.

However, this is not true if there is a change is the representation of different isoforms between the two conditions. If condition A uses longer or shorter isoforms than condition B, then this is going to cause problems.

tximport/DESeq2 address this by calculating a weighted mean length for each gene in each sample, and passing this to DESeq2, which then uses it in calculating differential expression. You wouldn't get this if you just loaded up a matrix of gene counts.