Hello Biostar members,
I am new to the field of NGS data analysis and would like to get some advice on transcriptome assembly of non model plant species. I have both 454 and illumina data for Brassica plant. I am using both reference-guided and de novo assembly approach. My question are as follows (please excuse me if these questions have already been asked before in the forum):
- Is the word size optimisation helpful in getting best CLC de novo assemblies? Does auto option (word size 20) suffice? How to choose the best word size for a plant species?
- How to evaluate the quality of the de novo assembly obtained? I have been looking mostly into the N50 value and also into number of contigs with ORFs to evaluate. Is that the right approach? What other factors should I look into?
- For the reference genome-guided transcriptome assembly, which program is the best? How to differentiate between genes from different genomes in a polyploid when we use one of the diploid parent species as a reference?
I really appreciate your advice and help.
Thanks
Hey,
On your first question I can't give you advice since I haven't used CLC. How to evaluate an assembly is a pretty hard one and actually depends on what you want to do with it. If you are only interested in as many possible/probable ORFs as possible the N50 shouldn't bother you too much, imho. However, maximizing your N50 comes with more ORFs possible to predict so maybe it is still worth to keep an eye on that even though you do not want to present the perfect assembly. In general the criteria you already stated are, say, complete enough, to get a good, acceptable assembly. For further information on "How to judge the quality of an assembly" have a look here or search for yourself for things like assemblerthon.
Your third question is also a pretty hard one because all the tools have pros and cons. Since you are new to the field I would suggest: Just try some of them. Probably the ones which sound most promising to you.