RNA-seq data for de-novo transcript assembly
1
1
Entering edit mode
6.7 years ago
liorglic ★ 1.4k

Hello,
I'm rather new to RNA-seq analysis and more familiar with DNA sequencing.
I'd like to perform de-novo assembly of transcripts from publicly-available (i.e published) RNA-seq data in tomato and its wild relative. the reasons I need to do that are:
a. I want to discover novel genes not present in the reference genome.
b. Wild relatives have no reliable reference.
The final purpose is actually genome annotation, and assembled transcripts will be used as input for annotation pipelines.
Now, there's quite a lot of raw data out there resulting from RNA-seq experiments. My problem is that I don't know which data sets are suitable for de-novo assembly. I know that many studies are designed to quantify expression levels of pre-defined genes, but I am currently not interested in that and would just like to get a sense of what data can be used for my purpose in terms of:
- Sequencing coverage
- Read length (in short-red and long-read technologies data)
- Strand-specific sequencing
- Other factors I'm not aware of?
I guess there are no definitive answers here, but there should be some standard. For example, in DNA genome assembly, you can't do much with, say 5x coverage. But I understand that for transcript-assembly, too deep is also a problem (although quite easy to solve). That's the kind of advice I'm looking for.
Thank you very much!

rna RNA-Seq rna-seq Assembly assembly • 2.1k views
ADD COMMENT
0
Entering edit mode

The reference assembly for the tomato genome should be quite OK, do you have any reason to expect you might find substantial amount of novel genes not present in the genome?

ADD REPLY
0
Entering edit mode

Yes, if I look at other varieties/cultivars other than the one used to produce the reference (Heinz). This had not yet been done in tomato, but in other organisms (e.g rice and maize) non-reference cultivars showed a substantial amount of novel genes not found in the reference.

ADD REPLY
0
Entering edit mode

Looking for resistance genes, are we? :P

ADD REPLY
0
Entering edit mode

No, not particularly...

ADD REPLY
0
Entering edit mode

Was worth a shot >:D

ADD REPLY
1
Entering edit mode
6.7 years ago
  • Sequence coverage:

Difficult to asses. there is not really an upper or lower limit I feel and usually it's a matter of costs. On the other hand it's often hard to reliable estimate the expected or wanted coverage as estimating the 'transcribed' genome part is not straightforward. Here you are in bit of a blessed case as you can use the tomato reference as a proxy. To much coverage should not pose to many problems as most transcript assembly tools should be able to deal with it as well as with very uneven coverage on a per transcript basis (with DNA seq you expect somewhat even coverage all over the genome, with transcriptome you will have much more variation (biological reasons))

  • Read length

here it's the longer the better (surprise surprise ;) ), if you have a choice I would certainly go for paired end reads and rather 150bp (or even 250bp) than 75 or so. If you have to possibility to go for long read technologies (ONT, PacBio) those are certainly preferred over any short read data, even if they come with lower coverage.

  • strand specific

Nice to have but not really crucial I would say. if you do have that kind of data make sure you use an assembly approach/tool that takes this kind of information into account.

ADD COMMENT
0
Entering edit mode

Thank you. This is very helpful.
As I said, I'm not planning on producing new RNA-seq data right now, but rather use data available from various DBs. So for example I found a data set comprised of ~2.8Gb sequencing data, with reads of length 61. Would you consider assembling this or would you say it's not enough?

ADD REPLY
1
Entering edit mode

If it's the only one you have for a certain cultivar or experiment then yes I would consider. People have been doing this (successfully) when 61bp was the only read length available. If on the other hand you also have longer read data for the same setup I would prefer those (or consider merging them).

ADD REPLY

Login before adding your answer.

Traffic: 2674 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6