Hello,
I'm rather new to RNA-seq analysis and more familiar with DNA sequencing.
I'd like to perform de-novo assembly of transcripts from publicly-available (i.e published) RNA-seq data in tomato and its wild relative. the reasons I need to do that are:
a. I want to discover novel genes not present in the reference genome.
b. Wild relatives have no reliable reference.
The final purpose is actually genome annotation, and assembled transcripts will be used as input for annotation pipelines.
Now, there's quite a lot of raw data out there resulting from RNA-seq experiments. My problem is that I don't know which data sets are suitable for de-novo assembly. I know that many studies are designed to quantify expression levels of pre-defined genes, but I am currently not interested in that and would just like to get a sense of what data can be used for my purpose in terms of:
- Sequencing coverage
- Read length (in short-red and long-read technologies data)
- Strand-specific sequencing
- Other factors I'm not aware of?
I guess there are no definitive answers here, but there should be some standard. For example, in DNA genome assembly, you can't do much with, say 5x coverage. But I understand that for transcript-assembly, too deep is also a problem (although quite easy to solve). That's the kind of advice I'm looking for.
Thank you very much!
The reference assembly for the tomato genome should be quite OK, do you have any reason to expect you might find substantial amount of novel genes not present in the genome?
Yes, if I look at other varieties/cultivars other than the one used to produce the reference (Heinz). This had not yet been done in tomato, but in other organisms (e.g rice and maize) non-reference cultivars showed a substantial amount of novel genes not found in the reference.
Looking for resistance genes, are we? :P
No, not particularly...
Was worth a shot >:D