DE analysis from kallisto estimated counts (only)
1
0
Entering edit mode
6 days ago
Lila M ★ 1.3k

Hello all, I have some 'raw' data from kallisto, and my aim is to perform a DE analysis in to conditions.

I have a count table from kallisto (estimated counts) like this:

head(x)
target_id  length sample1  sample2 sample3 
  <chr>       <dbl>                  <dbl>                  <dbl>                  <dbl>          
1 ENST00000…      8                      0                      0                      0                     
2 ENST00000…      9                      0                      0                      0                     
3 ENST00000…     13                      0                      0                      0                    
4 ENST00000…     23                      0                      0                      0            

and a effective length table from kallisto like this:

I don't have any other files to be used to generate the txi table, so

head(y)
# A tibble: 6 × 427
  target_id  length sample1  sample2  sample3  
  <chr>       <dbl>                  <dbl>                  <dbl>                  <dbl>                  
1 ENST00000…      8                   9                       7.5                   9                     
2 ENST00000…      9                  10                       8.5                  10                
3 ENST00000…     13                   1                      12.5                   3.5                   
4 ENST00000…     23                   5.88                   13.5                   6.33      
5 ENST00000…     19                   4.4                    12.7                   7         
6 ENST00000…     31                  10.6                    12.2                  11      

I know this data is not enough to perform the analysis properly, but I am considering if by using limma-voom and use the counts provided as a tx$count, and follow this approach:

First approach using limma-voom:

library(edgeR)
library(biomaRt)

#generate tx2gene: 

ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl")  # Use appropriate dataset
transcripts_list <- rownames(counts) 

tx2gene <- getBM(
      attributes = c("ensembl_transcript_id", "ensembl_gene_id", "hgnc_symbol"),  # Attributes you want to retrieve
      filters = "ensembl_transcript_id",  # Filter by transcript ID
      values = transcripts_list,  # Your list of transcript IDs
      mart = ensemble)

counts1 <- left_join(counts, tx2gene)

#aggregate the results as a gene-count table: 
gene_counts <- counts1 %>%
group_by(hgnc_symbol) %>% 
summarise(across(starts_with("count"), sum, na.rm = TRUE))

#perform Limma-vool
y <- DGEList(gene_counts)
design <- model.matrix(~ 0 + matrix$Group) 
colnames(design) <- levels(matrix$Group)
keep <- filterByExpr(y, design)
y <- y[keep, ]
y <- calcNormFactors(y)
v <- voom(y, design)
fit <- lmFit(v, design)
contrast.matrix <- makeContrasts(G1 - G2, levels = design)
fit2 <- contrasts.fit(fit, counts1)
fit2 <- eBayes(fit2)
res <- topTable(fit2, adjust="fdr", number = Inf)

Does anyone have any other advice/feedback about how to perform an analysis using this data? Do you think that approach would be correct for an exploratory analysis or should I just not use this data?

Thank you very much

deviation DE limma voom • 604 views
ADD COMMENT
0
Entering edit mode

This looks like edgeR not limma-voom. Anyway, what exactly is the problem here?

ADD REPLY
0
Entering edit mode

I just wanted to make sure if this approach is correct considering the inputs I am providing or if there is a different alternative I could use

Thanks!

ADD REPLY
0
Entering edit mode

my recommendation is that you produce files with counts first then do the differential expression so that you don't have to keep redoing that step.

basically summarize counts by gene, save that into a file, and that would be script 1.

now in a new script, load these new counts and do the differential expression there,

that way you don't have redo that step each time you run the differential expression, now you can explore the effect of filtering rows based on minimal counts

you should also investigate these new resulting summarized counts to make sure they make sense

ADD REPLY
0
Entering edit mode

That is basically what I did, but as I haven't generated the counts (and they are per transcripts) I just summarised by gene counts and then I made the DE. My only concern is that the counts are the raw counts from Kallisto and I don't have the other required files to load and get the gene counts by using tximpot

ADD REPLY
0
Entering edit mode

There are no other "required files". The only files you need are the raw kallisto counts.

ADD REPLY
0
Entering edit mode

I understand that to create the gene counts from kallisto you also need the abundance.tsv file, and then generate the gene counts from them and tx2gene.

ADD REPLY
0
Entering edit mode

The abundance.tsv is the raw kallisto output -- is there any reason that you can't get the abundance.tsv files? Running kallisto automatically gives you those files...

Otherwise, if not, what you're doing (i.e. manually summing up the counts) is probably a good enough approximation.

ADD REPLY
1
Entering edit mode
1 day ago
mesebaesko ▴ 10

Hi Lila, could you please specify the type of counts you are using? Kallisto typically generates Transcripts Per Million (TPM) values, which are not directly compatible with the limma package. If you are indeed using TPM values, your approach is on the right track. However, there is an issue with using counts directly from Kallisto, as limma cannot process these TPM values. To address this, you need to use a tool like tximport to correctly convert Kallisto TPM values into a format that limma can handle. Once this conversion is done, the rest of the analysis pipeline remains largely the same. You can use either use abundance.tsv or abumdance.h5 and it does not matter. input should be like a matrix

ID                                   Sample_1_TPM          Sample_2_TPM               Sample_3_TPM    ........
Transcript /gene_1          counts_value              counts_value                  counts_value       ......                           
Transcript /gene_2          counts_value              counts_value                  counts_value       ...... 
Transcript /gene_3          counts_value              counts_value                  counts_value       ...... 
Transcript /gene_4          counts_value              counts_value                  counts_value       ...... 
................

You can use this function and change parameters accordingly

txi <- tximport(
  files = sampleTable$file_path,
  type = "kallisto",
  txIn = TRUE,
  txOut = FALSE,
  countsFromAbundance = "scaledTPM",
  tx2gene = tx2gene[, c("tx_id", "gene_id")],
  ignoreTxVersion = TRUE
)
colnames(txi$counts) <- sampleTable$sample

Once this is done, you can pass this information to DGEList and continue your rest of analyzing

y <- DGEList(counts = txi$counts)
ADD COMMENT

Login before adding your answer.

Traffic: 1303 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6