Question

Best practises RNAseq - Sleuth vs. NA's in annotation

1

Entering edit mode

7.8 years ago

BioBing ▴ 150

Hi all,

Currently, I am working on my very first RNAseq study and have met a dilemma where inputs from more experienced bioinformaticians would be amazing.

For a differential gene expression study in a non-model organism, a de novo reference transcriptome was assembled from 300 M reads in Trinity. For 3 experimental conditions (1 negative control, 1 positive control and the treatment of interest) triplicate samples were sequenced with a depth of 25 M reads.

The reference transcriptome was annotated with Trinotate.

For differential gene expression determination the Kallisto/Sleuth pipeline was being used - and here comes my dilemma of best practices:

A number of the Trinity transcripts could not be annotated by Trinotate (NA) and is being dropped in the Sleuth analysis when using the "so <- sleuth_prep(s2c, ~treat, target_mapping = annotation, aggregation_column = 'gene')" expression.

I played around with the annotation file and replaced the NA's in the gene column with the corresponding Trinity transcript IDs, which included some of the transcripts as significantly differentially expressed.

What is the right thing to do?

Would you let Sleuth drop the non-annotated transcripts, even though some of them are significantly differentially expressed?
Or, would you include these transcripts in the "gene" column with their corresponding Trinity transcript IDs, even though they cannot be analyzed on the gene-level (the transcript isoforms cannot be collapsed in the analysis like the annotated ones)?

Thank you!

Cheers, Birgitte

RNA-Seq R rna-seq • 2.8k views

ADD COMMENT • link 7.8 years ago by BioBing ▴ 150

0

Entering edit mode

7.8 years ago

BioBing ▴ 150

Thank you both for very good and useful answers - it is tricky to figure out what is the "best practice", but I think I will add in the Trinity names, because some of them are highly significantly expressed.

ADD COMMENT • link 7.8 years ago by BioBing ▴ 150

score 2 · Accepted Answer · 2017-07-06

This is a tricky one, and often "the right thing to do", becomes "what I can logically defend". Frequently in research, you'll reach a point where there's not really the right way to do something, but the most sensible for your current situation that you can defend when asked.

Specifically to your question, I think it's logical for you to change your annotation file so that you include the respective trinnotate IDs where there is an NA. Just because trinity can't annotate something that it thinks is a transcript, doesn't mean that it's not potentially interesting. Also try to keep in mind that you're trying to do an incredibly complex task, with a technology that is sub-optimal for its purpose (assuming short read sequencing), and this assembly will not be perfect. You're going to see noise, i.e. transcripts that trinity thinks are real when they're not (false positive), and vice versa (false negatives).

The most logical thing to do depends on your experimental question, but if it's "Whats different in the assembled transcriptome between my conditions?", then focus on those that are differentially expressed. Anything where you've got a useful trinnotate annotation is fine, but those that have your replaced trinnotate IDs might be worth a follow up. Consider extracting one of those sequences and blast searching to inspect the results in a bit more detail.

score 2 · Accepted Answer · 2017-07-06

In addition to what andrew.j.skelton73 said, read carefully Trinity and Trinotate wikis, they have suggestions on how to filter RNAseq assemblies for further analyses. For differential expression, you should filter by level of expression, removing lowly expressed transcripts regardless of annotation status, as they will have low power anyway - specially with only three replicates per treatment.