Question

RNA-seq data for de-novo transcript assembly

1

Entering edit mode

6.7 years ago

liorglic ★ 1.4k

Hello,
I'm rather new to RNA-seq analysis and more familiar with DNA sequencing.
I'd like to perform de-novo assembly of transcripts from publicly-available (i.e published) RNA-seq data in tomato and its wild relative. the reasons I need to do that are:
a. I want to discover novel genes not present in the reference genome.
b. Wild relatives have no reliable reference.
The final purpose is actually genome annotation, and assembled transcripts will be used as input for annotation pipelines.
Now, there's quite a lot of raw data out there resulting from RNA-seq experiments. My problem is that I don't know which data sets are suitable for de-novo assembly. I know that many studies are designed to quantify expression levels of pre-defined genes, but I am currently not interested in that and would just like to get a sense of what data can be used for my purpose in terms of:
- Sequencing coverage
- Read length (in short-red and long-read technologies data)
- Strand-specific sequencing
- Other factors I'm not aware of?
I guess there are no definitive answers here, but there should be some standard. For example, in DNA genome assembly, you can't do much with, say 5x coverage. But I understand that for transcript-assembly, too deep is also a problem (although quite easy to solve). That's the kind of advice I'm looking for.
Thank you very much!

rna RNA-Seq rna-seq Assembly assembly • 2.1k views

ADD COMMENT • link updated 6.7 years ago by lieven.sterck 15k • written 6.7 years ago by liorglic ★ 1.4k

0

Entering edit mode

The reference assembly for the tomato genome should be quite OK, do you have any reason to expect you might find substantial amount of novel genes not present in the genome?

ADD REPLY • link 6.7 years ago by lieven.sterck 15k

0

Entering edit mode

Yes, if I look at other varieties/cultivars other than the one used to produce the reference (Heinz). This had not yet been done in tomato, but in other organisms (e.g rice and maize) non-reference cultivars showed a substantial amount of novel genes not found in the reference.

ADD REPLY • link 6.7 years ago by liorglic ★ 1.4k

0

Entering edit mode

Looking for resistance genes, are we? :P

ADD REPLY • link 6.7 years ago by cschu181 ★ 2.8k

0

Entering edit mode

No, not particularly...

ADD REPLY • link 6.7 years ago by liorglic ★ 1.4k

0

Entering edit mode

Was worth a shot >:D

ADD REPLY • link 6.7 years ago by cschu181 ★ 2.8k

score 1 · Answer 1 · 2018-03-12

Sequence coverage:

Difficult to asses. there is not really an upper or lower limit I feel and usually it's a matter of costs. On the other hand it's often hard to reliable estimate the expected or wanted coverage as estimating the 'transcribed' genome part is not straightforward. Here you are in bit of a blessed case as you can use the tomato reference as a proxy. To much coverage should not pose to many problems as most transcript assembly tools should be able to deal with it as well as with very uneven coverage on a per transcript basis (with DNA seq you expect somewhat even coverage all over the genome, with transcriptome you will have much more variation (biological reasons))

Read length

here it's the longer the better (surprise surprise ;) ), if you have a choice I would certainly go for paired end reads and rather 150bp (or even 250bp) than 75 or so. If you have to possibility to go for long read technologies (ONT, PacBio) those are certainly preferred over any short read data, even if they come with lower coverage.

strand specific

Nice to have but not really crucial I would say. if you do have that kind of data make sure you use an assembly approach/tool that takes this kind of information into account.