Your approach seems reasonable and well-informed, though it's perhaps less customary in the field, which could be why you haven't found papers doing the exact same thing. Your plan to supplement the reference transcriptome with novel isoforms from IsoSeq data and then align the short-read RNA-Seq data to this enhanced transcriptome using Salmon is a sound strategy.
Here’s a step-by-step breakdown of your approach and how it could address your concerns:
Transcriptome Supplementation with IsoSeq Data:
Processing the IsoSeq data to identify novel isoforms and adding them to the reference transcriptome ensures that the alignment of short-read data can capture a more complete spectrum of isoforms present in your samples. This can potentially lead to more accurate expression quantification at the isoform level because it mitigates the issue of missing novel isoforms that might not be as well represented in the reference alone.
Alignment with Short-Read Data:
By using Salmon for alignment, you take advantage of its lightweight, quasi-mapping approach that is fast and memory-efficient. Salmon is also designed to quantify isoform abundances directly from RNA-Seq data, which fits your objective.
It's important to note that while combining both data types may improve the comprehensiveness of the transcriptome annotation, it could also introduce some challenges. IsoSeq data may contain novel transcripts that are low in abundance or tissue-specific, which may not be represented in the short-read data and could potentially complicate the differential expression analysis if these transcripts do not have sufficient coverage.
Regarding the second part of your question:
Differential Abundance at the Isoform Level:
Typically, methods like DESeq2 and edgeR require raw counts as input, not TPM values. However, Salmon provides quantifications in a format that is amenable to being imported and analyzed with the tximport package in R, which then allows you to use DESeq2 or edgeR for differential expression analysis at the isoform level.
Here’s how you might proceed:
Use Salmon to get isoform-level TPM values.
Convert Salmon’s output to gene-level counts using the tximport package in R.
Proceed with differential expression analysis using DESeq2 or edgeR, which will now have the appropriate input format thanks to tximport's handling of the data conversion.
Example of commands that you could use for this process:
library(tximport)
library(DESeq2)
# Assuming you have Salmon outputs in directories named as such:
salmon_dirs <- c("sample1", "sample2", "sample3", ...)
# Importing the data
txi <- tximport(files = salmon_dirs, type = "salmon", txOut = TRUE)
# Preparing a sample table for DESeq2 (ensure sample information matches the order in txi)
sampleTable <- data.frame(condition = factor(c(rep("genotype1", n), rep("genotype2", n), rep("genotype3", n))))
# Running DESeq2 for differential expression analysis
dds <- DESeqDataSetFromTximport(txi, colData = sampleTable, design = ~ condition)
dds <- DESeq(dds)
res <- results(dds)
Thanks biofalconch for the advice.
Re:
stringtie
- I hadn't heard of this, it appears to be a transcriptome assembler, is that correct?What is the purpose of using this if I already have:
isoseq3
pigeon
I have used the
sorted_classification.filtered_lite_classification.txt
output frompigeon
to select only the novel collapsed isoforms and add these to the reference transcriptome (gencode.vM22.transcripts.fa
)Thanks for any advice you can give. This is my first attempt at using Isoseq data.
What
stringtie
is not only assembling, but also merging two different annotations. Throughstringtie merge
you would be able to end up with a GTF file that all the transcript isoforms will have the same gene ID, which will facilitate the downstream analyses.You may also be interested in IsoformSwitchAnalyeR and DEXseq for looking at isoform switching and differential exon usage.