Question

DE analysis from kallisto estimated counts (only)

1

Entering edit mode

5 months ago

Lila M ★ 1.3k

Hello all, I have some 'raw' data from kallisto, and my aim is to perform a DE analysis in to conditions.

I have a count table from kallisto (estimated counts) like this:

head(x)
target_id  length sample1  sample2 sample3 
  <chr>       <dbl>                  <dbl>                  <dbl>                  <dbl>          
1 ENST00000…      8                      0                      0                      0                     
2 ENST00000…      9                      0                      0                      0                     
3 ENST00000…     13                      0                      0                      0                    
4 ENST00000…     23                      0                      0                      0

and a effective length table from kallisto like this:

I don't have any other files to be used to generate the txi table, so

head(y)
# A tibble: 6 × 427
  target_id  length sample1  sample2  sample3  
  <chr>       <dbl>                  <dbl>                  <dbl>                  <dbl>                  
1 ENST00000…      8                   9                       7.5                   9                     
2 ENST00000…      9                  10                       8.5                  10                
3 ENST00000…     13                   1                      12.5                   3.5                   
4 ENST00000…     23                   5.88                   13.5                   6.33      
5 ENST00000…     19                   4.4                    12.7                   7         
6 ENST00000…     31                  10.6                    12.2                  11

I know this data is not enough to perform the analysis properly, but I am considering if by using limma-voom and use the counts provided as a tx$count, and follow this approach:

First approach using limma-voom:

library(edgeR)
library(biomaRt)

#generate tx2gene: 

ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")  # Use appropriate dataset
transcripts_list <- rownames(counts) 

tx2gene <- getBM(
      attributes = c("ensembl_transcript_id", "ensembl_gene_id", "hgnc_symbol"),  # Attributes you want to retrieve
      filters = "ensembl_transcript_id",  # Filter by transcript ID
      values = transcripts_list,  # Your list of transcript IDs
      mart = ensemble)

counts1 <- left_join(counts, tx2gene)

#aggregate the results as a gene-count table: 
gene_counts <- counts1 %>%
group_by(hgnc_symbol) %>% 
summarise(across(starts_with("count"), sum, na.rm = TRUE))

#perform Limma-vool
y <- DGEList(gene_counts)
design <- model.matrix(~ 0 + matrix$Group) 
colnames(design) <- levels(matrix$Group)
keep <- filterByExpr(y, design)
y <- y[keep, ]
y <- calcNormFactors(y)
v <- voom(y, design)
fit <- lmFit(v, design)
contrast.matrix <- makeContrasts(G1 - G2, levels = design)
fit2 <- contrasts.fit(fit, counts1)
fit2 <- eBayes(fit2)
res <- topTable(fit2, adjust="fdr", number = Inf)

Does anyone have any other advice/feedback about how to perform an analysis using this data? Do you think that approach would be correct for an exploratory analysis or should I just not use this data?

Thank you very much

deviation DE limma voom • 1.2k views

ADD COMMENT • link updated 4 months ago by GenoMax 150k • written 5 months ago by Lila M ★ 1.3k

0

Entering edit mode

This looks like edgeR not limma-voom. Anyway, what exactly is the problem here?

ADD REPLY • link 4 months ago by dsull ★ 7.4k

0

Entering edit mode

I just wanted to make sure if this approach is correct considering the inputs I am providing or if there is a different alternative I could use

Thanks!

ADD REPLY • link 4 months ago by Lila M ★ 1.3k

0

Entering edit mode

my recommendation is that you produce files with counts first then do the differential expression so that you don't have to keep redoing that step.

basically summarize counts by gene, save that into a file, and that would be script 1.

now in a new script, load these new counts and do the differential expression there,

that way you don't have redo that step each time you run the differential expression, now you can explore the effect of filtering rows based on minimal counts

you should also investigate these new resulting summarized counts to make sure they make sense

ADD REPLY • link 4 months ago by Istvan Albert 102k

0

Entering edit mode

That is basically what I did, but as I haven't generated the counts (and they are per transcripts) I just summarised by gene counts and then I made the DE. My only concern is that the counts are the raw counts from Kallisto and I don't have the other required files to load and get the gene counts by using tximpot

ADD REPLY • link 4 months ago by Lila M ★ 1.3k

0

Entering edit mode

There are no other "required files". The only files you need are the raw kallisto counts.

ADD REPLY • link 4 months ago by dsull ★ 7.4k

0

Entering edit mode

I understand that to create the gene counts from kallisto you also need the abundance.tsv file, and then generate the gene counts from them and tx2gene.

ADD REPLY • link 4 months ago by Lila M ★ 1.3k

1

Entering edit mode

The abundance.tsv is the raw kallisto output -- is there any reason that you can't get the abundance.tsv files? Running kallisto automatically gives you those files...

Otherwise, if not, what you're doing (i.e. manually summing up the counts) is probably a good enough approximation.

ADD REPLY • link 4 months ago by dsull ★ 7.4k

GenoMax · Answer 1 · 2024-11-28

Hi Lila, could you please specify the type of counts you are using? Kallisto typically generates Transcripts Per Million (TPM) values, which are not directly compatible with the limma package. If you are indeed using TPM values, your approach is on the right track. However, there is an issue with using counts directly from Kallisto, as limma cannot process these TPM values. To address this, you need to use a tool like tximport to correctly convert Kallisto TPM values into a format that limma can handle. Once this conversion is done, the rest of the analysis pipeline remains largely the same. You can use either use abundance.tsv or abumdance.h5 and it does not matter. input should be like a matrix

ID                                   Sample_1_TPM          Sample_2_TPM               Sample_3_TPM    ........
Transcript /gene_1          counts_value              counts_value                  counts_value       ......                           
Transcript /gene_2          counts_value              counts_value                  counts_value       ...... 
Transcript /gene_3          counts_value              counts_value                  counts_value       ...... 
Transcript /gene_4          counts_value              counts_value                  counts_value       ...... 
................

You can use this function and change parameters accordingly

txi <- tximport(
  files = sampleTable$file_path,
  type = "kallisto",
  txIn = TRUE,
  txOut = FALSE,
  countsFromAbundance = "scaledTPM",
  tx2gene = tx2gene[, c("tx_id", "gene_id")],
  ignoreTxVersion = TRUE
)
colnames(txi$counts) <- sampleTable$sample

Once this is done, you can pass this information to DGEList and continue your rest of analyzing

y <- DGEList(counts = txi$counts)