Question

Can you detect novel (non-coding) transcripts using Salmon and the hg38 transcriptome?

2

Entering edit mode

6.1 years ago

c_u ▴ 530

Hi,

This might be another naive question. I have RNA-Seq data from 2 different types of samples (pain vs no-pain), and I want to identify novel lncRNA/eRNA in pain. I also have DHS/ATAC data for that gives me accessible regions in the DNA, some of which would be regulatory.

One way to do this is to use Trinity (using all samples) for de-novo transcript assembly, then use Salmon on this assembled transcriptome to find transcripts that are differentially expressed, and then among those, the transcripts that don't have prior annotation would be 'novel' transcripts. Cufflinks can also be used for this.

Another way would be to just directly run Salmon on the samples, using the hg38 transcriptome, and find differentially expressed transcripts, some of which could be novel.

I want to know if doing this via the first method will lead to more novel transcripts, and are there any merits to the first method over the second, in general.

Thank you!

RNA-Seq • 4.5k views

ADD COMMENT • link 6.1 years ago by c_u ▴ 530

3

Entering edit mode

6.1 years ago

WouterDeCoster 48k

If you use the known transcriptome you are, by definition, not going to find anything novel.

ADD COMMENT • link 6.1 years ago by WouterDeCoster 48k

0

Entering edit mode

Hi Wouter, thank you for the answer. I meant novel transcripts that could be lncRNA/eRNA because they overlap with available DNAse Hypersensitivity data. In other words, the DHS data tells us that all these regions in the DNA are open access, and then we see that actually they produce transcripts too, so they could be novel eRNA. So, in that context, can we expect any difference in finding novel regulatory transcripts while using the 2 methods?

ADD REPLY • link 6.1 years ago by c_u ▴ 530

2

Entering edit mode

6.1 years ago

lieven.sterck 15k

option 1 ! (though the title of your post does not really covers this option)

as WouterDeCoster already mentioned. if you use the hg38 known transcriptome (your option 2) than indeed you wont find any novel transcripts. You might find "known" genes that lack any functional annotation that could be 'novel' in the context of your biological question but you will definitely not find any un-annotated (as in : up to now not structurally annotated) transcripts.

With your option one you will/could find novel transcripts as the assembled transcriptome can contain transcripts that not have been annotated in the genome (yet).

However, given the amount of work that has gone in to annotating the human genome I think there's not a big chance you will find many of those. Moreover, you will need to start discriminating between noise and true potential novel transcripts which might not be as easy as it sounds.

bottom line, yes you kinda can but there are few things to be aware of when doing so.

ADD COMMENT • link 6.1 years ago by lieven.sterck 15k

0

Entering edit mode

Hi Lieven, thank you for your response. You mention option 1 as the way to go, but in another answer to this question, Kristoffer says that he doesn't see a point in using Trinity for human samples as its highly likely to find any new transcripts in human samples. You also mention something similar in your answer. In that case, would you say that using Salmon/StringTie to find differentially expressed transcripts is the better option, and then aim at intersecting it with DNA accesibility data in order to find novel enhancers/lncRNA?

ADD REPLY • link 6.1 years ago by c_u ▴ 530

1

Entering edit mode

well yes and no.

With your option 2 you will never find any new transcript as you only look at the known ones. Option 1 is your only valid option to find anything new (== un-annotated).

I could kinda agree that Trinity might not be the best option (though also not the worst one either and it will work as well) and that it makes sense to at least use the genome sequence to 'guide' your assembly. I do however not agree with the and part of his answer: if you want to find anything novel you must avoid being biased towards the known transcriptome as in that case you restrict yourself again to only the known transcripts. Cufflinks (would not recommend anymore, deprecated) and certainly StringTie are worthy alternatives indeed.

using Salmon/StringTie to find differentially expressed transcripts is the better option

that is for sure the most straightforward approach yes, "better" perhaps not ... it all depends on how novel you want your transcripts to be.

ADD REPLY • link 6.1 years ago by lieven.sterck 15k

1

Entering edit mode

I don't think Cufflinks/Cuffdiff is deprecated...? Are you perhaps thinking of Tophat? (where one should use Hisat2 instead).

In relation to transcript assembly we need to distinguish between just quantifying known transcripts, de-novo assembly and guided assembly. Both Cufflinks and StringTie have the option to do a guided assembly (via –GTF-guide for Cufflinks and -G for StringTie ) so that the resulting transcriptome will contain known transcripts in addition to novel once identified in the data. This option works very well in my experience and I personally believe that ignoring the high quality transcriptomes we have for many organisms will result in less trustworthy results since a very large fraction of the mRNAs you need to quantify will be the once annotated in the databases.

ADD REPLY • link 6.1 years ago by Kristoffer Vitting-Seerup ★ 4.2k

1

Entering edit mode

That's debatable ;) , it uses assumptions/approach which are nowadays considered as being outdated (eg FPKm values ... )

Of course I agree that one should not ignore the (well-annotated) human transcriptome but OP specifically asked an approach to identify novel transcripts. In the context of not having to spend time on already known transcripts it might indeed be good to be able to already 'filter-out' the known ones indeed.

ADD REPLY • link 6.1 years ago by lieven.sterck 15k

score 3 · Accepted Answer · 2019-07-03

3

Entering edit mode

6.1 years ago

Kristoffer Vitting-Seerup ★ 4.2k

I would not use Trinity on human samples. Trinity is mainly usable when you do not have a reference genome.

Since you have a reference genome and a well annotated transcriptome you are much better of using tools which perform guided de-novo transcript reconstruction. The most well known tools for that are Cufflinks and StringTie. I have written more about the detailed steps and considerations here.

ADD COMMENT • link 6.1 years ago by Kristoffer Vitting-Seerup ★ 4.2k

0

Entering edit mode

Hi Kristoffer, thank you for your response. The linked bioconductor page is also very helpful! I had a question about what you said. Here you mention that for de-novo transcript reconstruction, Cufflinks/StringTie are good options. Whereas in the linked page you mention Salmon/Kallisto being good options for this work. So, I am slightly confused. If my goal is to find novel enhancer transcripts/lncRNA, then which of the two options would make more sense?

ADD REPLY • link 6.1 years ago by c_u ▴ 530

2

Entering edit mode

Salmon/Kallisto are very good at quantifying known transcripts - but you are looking for novel features so you cannot use those. Instead you would need to use Cufflinks/StringTie - they can both find novel features. Make sure to read the manual pages carefully as they may filter very lowly expressed features out using their default cutoffs and you are looking for (very) lowly expressed features.

ADD REPLY • link 6.1 years ago by Kristoffer Vitting-Seerup ★ 4.2k

0

Entering edit mode

Hi Kristoffer.vittingseerup, I used StringTie and it came out with a list of genes in this form of identification MSTRG. I gave up of using StringTie for now because I couldn't find any way to convert this format in ensemble id or HGNC, do you have advices for that? Maybe I'm a bit OT in this discussion...

Thank you also for your workflow, I will take a look

ADD REPLY • link 6.1 years ago by Morris_Chair ▴ 370

2

Entering edit mode

When you run Stringtie with the --merge function you just have to add in the annotation GTF (the same you used to do guided predictions with) and it will add the gene names as extra columns (naturally only for the known features - the novel will not have any).