Regarding the second part of your question:

Question

Differential Expression using Isoseq-supplemented reference transcriptome

1

Entering edit mode

15 months ago

Calum ▴ 10

Hi all,

I have a dataset of Illumina short read RNA-Seq data from (n = 6 per group) three different mouse genotypes, and paired PacBio Isoseq data from a subset of these (n = 2 per group).

I have processed the IsoSeq data following the workflow on the isoseq3 tutorial. My plan is to use the high-quality isoforms classified as novel (Novel Not In Catalog, and Novel in Catalog) to supplement the mouse reference transcriptome and then align the paired short read data against this reference transcriptome using Salmon. My questions are:

Is this an appropriate approach? I can't find any papers which have used this method, they all seem to a) generate a reference transcriptome using IsoSeq data only and use that as the reference transcriptome for short read mapping with Salmon or Kallisto (I worry this will cause me to miss isoforms which were not sequenced by the isoseq data but were sequenced in the Illumina data), or b) align to the reference genome after combining the reference GFF with a GFF generated from the novel isoforms (to my knowledge, this won't give isoform-level information, only gene level info).
What is the best method to test for differential abundance at the isoform level? Salmon returns TPM values instead of the counts which Limma or DESeq2 requires.

I would be grateful for any information or discussion.
Thanks,
Calum

RNA-Seq Salmon Isoseq • 2.0k views

ADD COMMENT • link updated 7 months ago by Gordon Smyth ★ 7.7k • written 15 months ago by Calum ▴ 10

score 1 · Answer 1 · 2023-08-29

1

Entering edit mode

15 months ago

biofalconch ★ 1.3k

Hey, I'll go straight to the point.

Is this an appropriate approach?

Yes, nothing wrong on wanting to improve an annotation. My suggestion would be to use tools like stringtie to produce a transcriptome out of your IsoSeq reads, and then use stringtie merge so that you have a unified annotation :)

What is the best method to test for differential abundance at the isoform level?

I recommend having a look at the tximport (you can look at their documentation here, step by step how to go from Kallisto/Salmon outputs to DESeq2 ).

Cheers!

ADD COMMENT • link 15 months ago by biofalconch ★ 1.3k

0

Entering edit mode

Thanks biofalconch for the advice.
Re: stringtie - I hadn't heard of this, it appears to be a transcriptome assembler, is that correct?
What is the purpose of using this if I already have:

the collapsed isoforms from isoseq3
their associated classification information from pigeon

I have used the sorted_classification.filtered_lite_classification.txt output from pigeon to select only the novel collapsed isoforms and add these to the reference transcriptome (gencode.vM22.transcripts.fa)

Thanks for any advice you can give. This is my first attempt at using Isoseq data.

ADD REPLY • link 15 months ago by Calum ▴ 10

0

Entering edit mode

What stringtie is not only assembling, but also merging two different annotations. Through stringtie merge you would be able to end up with a GTF file that all the transcript isoforms will have the same gene ID, which will facilitate the downstream analyses.

ADD REPLY • link 15 months ago by biofalconch ★ 1.3k

0

Entering edit mode

You may also be interested in IsoformSwitchAnalyeR and DEXseq for looking at isoform switching and differential exon usage.

ADD REPLY • link 15 months ago by jared.andrews07 ★ 18k

score 0 · Answer 2 · 2024-04-28

Your approach seems reasonable and well-informed, though it's perhaps less customary in the field, which could be why you haven't found papers doing the exact same thing. Your plan to supplement the reference transcriptome with novel isoforms from IsoSeq data and then align the short-read RNA-Seq data to this enhanced transcriptome using Salmon is a sound strategy.

Here’s a step-by-step breakdown of your approach and how it could address your concerns:

Transcriptome Supplementation with IsoSeq Data:

Processing the IsoSeq data to identify novel isoforms and adding them to the reference transcriptome ensures that the alignment of short-read data can capture a more complete spectrum of isoforms present in your samples. This can potentially lead to more accurate expression quantification at the isoform level because it mitigates the issue of missing novel isoforms that might not be as well represented in the reference alone.

Alignment with Short-Read Data:

By using Salmon for alignment, you take advantage of its lightweight, quasi-mapping approach that is fast and memory-efficient. Salmon is also designed to quantify isoform abundances directly from RNA-Seq data, which fits your objective.

It's important to note that while combining both data types may improve the comprehensiveness of the transcriptome annotation, it could also introduce some challenges. IsoSeq data may contain novel transcripts that are low in abundance or tissue-specific, which may not be represented in the short-read data and could potentially complicate the differential expression analysis if these transcripts do not have sufficient coverage.

Regarding the second part of your question:

Differential Abundance at the Isoform Level: Typically, methods like DESeq2 and edgeR require raw counts as input, not TPM values. However, Salmon provides quantifications in a format that is amenable to being imported and analyzed with the tximport package in R, which then allows you to use DESeq2 or edgeR for differential expression analysis at the isoform level.

Here’s how you might proceed:

Use Salmon to get isoform-level TPM values. Convert Salmon’s output to gene-level counts using the tximport package in R. Proceed with differential expression analysis using DESeq2 or edgeR, which will now have the appropriate input format thanks to tximport's handling of the data conversion.

Example of commands that you could use for this process:

library(tximport)
library(DESeq2)

# Assuming you have Salmon outputs in directories named as such:
salmon_dirs <- c("sample1", "sample2", "sample3", ...)

# Importing the data
txi <- tximport(files = salmon_dirs, type = "salmon", txOut = TRUE)

# Preparing a sample table for DESeq2 (ensure sample information matches the order in txi)
sampleTable <- data.frame(condition = factor(c(rep("genotype1", n), rep("genotype2", n), rep("genotype3", n))))

# Running DESeq2 for differential expression analysis
dds <- DESeqDataSetFromTximport(txi, colData = sampleTable, design = ~ condition)
dds <- DESeq(dds)
res <- results(dds)

score 0 · Answer 3 · 2024-04-28

What is the best method to test for differential abundance at the isoform level? Salmon returns TPM values instead of the counts which Limma or DESeq2 requires.

The edgeR function catchSalmon reads the Salmon output and transforms it to a form suitable for DE analysis at the isoform level. See:

Baldoni PL#, Chen Y#, Hediyeh-zadeh S, Liao Y, Dong X, Ritchie ME, Shi W, Smyth GK (2024). Dividing out quantification uncertainty allows efficient assessment of differential transcript expression with edgeR. Nucleic Acids Research 52(3), e13.