I am a Biologist and a novice in analysis of NGS data. I have a set of six transcriptomes. I want to find the expression of the coding genes as well as lnRNA in each set and than compare them to find the co-expressed clusters. For that I need the FPKM of coding genes as well lncRNA. I have the experience of using Tophat + Cufflinks based De-Novo and RABT assembly and find FPKM of coding genes. But how to annotate the lncRNA. In Tophat + Cuuflinks mapping and assembly, the genes are assembled based on the supplied GTF file while novel cases like novel genes or isoform of existing genes are found based on novel junctions. Whether the lncRNA co-ordinates will also be present in the GTF file..??
My transcriptome is not from human. It is from avian species Taniophygia guttata (Finch). I guess the GENCODE only contains data for human and mouse. How can I find the co-ordinates of lncRNA in GTF format for the said species. If not available, I think I had to adopt de-novo assembly approach. Do you have any suggestion for that..
There may be similar projects or data out there for your species of interest. You would need to check around the various genomics resources or people doing genomics on finch to see if that is the case. If nothing is known about lncRNA in your species than your transcripts would need to be annotated by homology searching to the closest relative with data on lncRNA.
For de novo Assembly Trinity is quite popular and there is also a newer program called Sailfish that is supposed to be interesting for isoform abundance. How either deals with ncRNA though I am not sure. You would need to investigate to see what they are doing. They should definitely fall out of a trinity assembly since they are long enough.
Thanks for your suggestion. As I searched in the literature, I found nothing is known about the lncRNA of my species of interest as well as any close relative of it.
So my plan is to predict the putative lncRNA. I will use Cufflinks RABT assembly approach to assemble the known as well as novel transcripts. Then I will check whether the assembled transcript co-ordinates falls in the exonic region, intronic region or intergenic region of the reference genome. Those transcripts falling in the intronic and intergenic region may be the putative lncRNA. Next, I will examine the coding potential of the predicted lncRNA. Do you think this approach is okay to predict the lncRNA's?
Seems reasonable to me. I'm sure there are papers out there of groups doing similar things (predicting lncRNA), I would read through that literature as well to see what approaches and software people are typically using.
Thanks, I already read some literature and I found this is the usual approach. Although we may miss some because some of the lncRNA are anti-sense to coding sequence.