Does a high RPKM value always report that the transcript is significant? How far is it reliable? If so, what could be an optimal RPKM value to pin point if a transcript is significant or not?
Are there any other parameters to reduce number of contigs from the denovo assembly and concentrate on only significant transcripts.
What my lab does is we throw in ERCC spike-ins into the samples. They are poly-A sequences of known concentration. So you can look at them and if, say, samples with an RPKM of 2-10 are still behaving linearly, then it's probably safe to say that real transcripts with RPKMs that low are behaving linearly.
In my lab, with the experiments we run, and the purposes of those experiments, we've been setting a, loose cut-off at .5 RPKM, or 1, to be more stringent. But I wouldn't count on that value being necessarily applicable to your lab, or your experiments.
It really depends what you mean by significant? Reading between the lines, it seems as though you want to try to separate 'real' contigs from assembly artefacts. If that's the case, you should think carefully before discarding transcripts with a low RPKM.
There is no minimum - a contig representing a real transcript can have very low numbers of reads mapping to it, and have an extremely low RPKM. Equally, a high RPKM doesn't guarantee that the contig represents a real transcript. We often see chimeric contigs - where fragments from two or more different transcripts have been assembled into one contig. These chimeras often have high RPKM values, even though they are artefacts.
So, the answer to your question 1 is no, high RPKM does not mean you can be confident in the transcript - it isn't reliable. Thus there is no appropriate RPKM to making such a decision.
As for question 2, it really depends what you want to do with your assembled transcripts. Are you performing differential expression? Motif discovery? Are you interested in a particular set of genes?
@Richard: Understood thank you. For the question 2, Yes i just wanted to focus on only a set of sequences which are reliable for further downstream analysis like differential expression analysis. My assembly seem to be fragmented alot resulting ~100's of thousands of contigs.
Could it be safe to trace a diagram of all RPKM values (should give a normal distribution), and then say that +/-1 sigma are "average/moderately" expressed genes, up of that are highly expressed genes and down are low expressed genes. Overall, you'll have 68.2% of average expression, and 15.9% of low and 15.9% of highly expressed genes. Not really an experimental evidence (although you derive those from your data), but basically logical assumption. I doubt that throwing polyA in your RNA-seq library will give a better conclusion since those will never behave like mRNAs with all their respective complexity.
Honestly, yes. I don't know if others can confirms this, but I see it in my data. Of course, you have to log transform RPKM values otherwise the dispersion is enormous due to the extreme values. I've seen it also in at least one recent paper, but I just can't find the ref right now.
@Richard: Understood thank you. For the question 2, Yes i just wanted to focus on only a set of sequences which are reliable for further downstream analysis like differential expression analysis. My assembly seem to be fragmented alot resulting ~100's of thousands of contigs.