Hello all, I have some 'raw' data from kallisto, and my aim is to perform a DE analysis in to conditions.
I have a count table from kallisto (estimated counts) like this:
head(x)
target_id length sample1 sample2 sample3
<chr> <dbl> <dbl> <dbl> <dbl>
1 ENST00000… 8 0 0 0
2 ENST00000… 9 0 0 0
3 ENST00000… 13 0 0 0
4 ENST00000… 23 0 0 0
and a effective length table from kallisto like this:
I don't have any other files to be used to generate the txi table, so
head(y)
# A tibble: 6 × 427
target_id length sample1 sample2 sample3
<chr> <dbl> <dbl> <dbl> <dbl>
1 ENST00000… 8 9 7.5 9
2 ENST00000… 9 10 8.5 10
3 ENST00000… 13 1 12.5 3.5
4 ENST00000… 23 5.88 13.5 6.33
5 ENST00000… 19 4.4 12.7 7
6 ENST00000… 31 10.6 12.2 11
I know this data is not enough to perform the analysis properly, but I am considering if by using limma-voom and use the counts provided as a tx$count, and follow this approach:
First approach using limma-voom:
library(edgeR)
library(biomaRt)
#generate tx2gene:
ensembl <- useMart("ensembl", dataset = "hsapiens_gene_ensembl") # Use appropriate dataset
transcripts_list <- rownames(counts)
tx2gene <- getBM(
attributes = c("ensembl_transcript_id", "ensembl_gene_id", "hgnc_symbol"), # Attributes you want to retrieve
filters = "ensembl_transcript_id", # Filter by transcript ID
values = transcripts_list, # Your list of transcript IDs
mart = ensemble)
counts1 <- left_join(counts, tx2gene)
#aggregate the results as a gene-count table:
gene_counts <- counts1 %>%
group_by(hgnc_symbol) %>%
summarise(across(starts_with("count"), sum, na.rm = TRUE))
#perform Limma-vool
y <- DGEList(gene_counts)
design <- model.matrix(~ 0 + matrix$Group)
colnames(design) <- levels(matrix$Group)
keep <- filterByExpr(y, design)
y <- y[keep, ]
y <- calcNormFactors(y)
v <- voom(y, design)
fit <- lmFit(v, design)
contrast.matrix <- makeContrasts(G1 - G2, levels = design)
fit2 <- contrasts.fit(fit, counts1)
fit2 <- eBayes(fit2)
res <- topTable(fit2, adjust="fdr", number = Inf)
Does anyone have any other advice/feedback about how to perform an analysis using this data? Do you think that approach would be correct for an exploratory analysis or should I just not use this data?
Thank you very much
This looks like edgeR not limma-voom. Anyway, what exactly is the problem here?
I just wanted to make sure if this approach is correct considering the inputs I am providing or if there is a different alternative I could use
Thanks!
my recommendation is that you produce files with counts first then do the differential expression so that you don't have to keep redoing that step.
basically summarize counts by gene, save that into a file, and that would be script 1.
now in a new script, load these new counts and do the differential expression there,
that way you don't have redo that step each time you run the differential expression, now you can explore the effect of filtering rows based on minimal counts
you should also investigate these new resulting summarized counts to make sure they make sense
That is basically what I did, but as I haven't generated the counts (and they are per transcripts) I just summarised by gene counts and then I made the DE. My only concern is that the counts are the raw counts from Kallisto and I don't have the other required files to load and get the gene counts by using tximpot
There are no other "required files". The only files you need are the raw kallisto counts.
I understand that to create the gene counts from kallisto you also need the abundance.tsv file, and then generate the gene counts from them and tx2gene.
The abundance.tsv is the raw kallisto output -- is there any reason that you can't get the abundance.tsv files? Running kallisto automatically gives you those files...
Otherwise, if not, what you're doing (i.e. manually summing up the counts) is probably a good enough approximation.