Hello everyone
I have RNAseq data of rat species and I want to detect lncRNA. Does anyone have any idea about which soft ware or tool that I can use to analyse my FASTA file?
Best
Shaima
Hello everyone
I have RNAseq data of rat species and I want to detect lncRNA. Does anyone have any idea about which soft ware or tool that I can use to analyse my FASTA file?
Best
Shaima
Hi,
As apelin20 said, I am not sure what format your reads are in. Assuming you have reads in fastq format, use any splice-aware aligner (STAR, TopHat, HSAT2 etc.) to align reads and provide a GTF file for annotation. Get this GTF from gencode which has comprehensive anno. for (protein coding and) lncRNAs. Once you have used a downstream FPKM quantifier like Cufflinks or StringTie, you would your detected lncRNAs (along with protein coding mRNAs).
If somehow you have a fasta file (though it seems less probable why would you do de novo assembly for mouse. Or maybe fastq converted to fasta just to save space. Not sure), then get the fasta file for the lncRNAs from here and build a custom BLAST database and use blastn/ megablast to arrive to lncRNA supporting reads.
You could also build custom db for BWA or Bowtie using the fasta from Gencode and then do alignment. The alignment here would be faster but if your query file is fasta then you would need to specify that to the aligner to assume fasta input.
hello Amantin,
I want detect and get Differential Expression lncRNA from some human (control and treatment) RNA-seq data ( Illumina sequencing data) in fastq format,in your opinion how can I compare control and treatment lncRNA that I detected from your workflow? for detecting genes from these data, I use CLC Genomics and get Differential Expression.
Your attention would be really appreciated.
hi, For better coverage of lncRNA, total RNA-seq is advisable as compared to polyA selected RNA-seq. In any case, you should be able to find out what is the annotation file being used in the CLC Genomics software. If that annotation file contains lncRNA (along with protein-coding genes), then any diff-exp lncRNA, if present, would be in the result too. Since, CLC is paid s/w, best ask them how to include lncRNAs in the pipeline.
Thank you for your attention. Yes,I ask them too. When I did "RNA-seq analysis" in clc genomics, I received two outputs GE&TE. I get Differential Expression for both of them, now is it true, analyzing TE that have fold change>1.5 and p-value<.05 for getting diff-exp lncRNA. I mean align only TE that pass the filter?
hi, I haven't used that paid s/w and hence have no idea what do GE/ TE stand for. Maybe gene & transcript. Yes, that could be a valid criteria. But it also depends on sample type, strength of 'treatment'/ bio. effect, among other things, to settle on a good threshold.
sorry, you are not making sense. I am not sure what definition are you applying for calling lncRNA. They need not necessarily be in introns of known genes; they can be independent transcriptional units or embedded around known genes. Your s/w I think would have already used a splice-aware method to perform the alignment. The lowest hanging fruit would be to exploit known lncRNAs using annotation resources like Gencode. But, to do so you should ensure that you have used the GTF before the diff. exp. steps happened, i.e. during alignment and transcript quantification. Also, it might be a good idea to find what are recommended steps for rna-seq analysis. This article might be of use.
hello Dear Amantin, I again came back with a number of questions. Amantin,last night I used spliceseq software (windows operating systems) to align reads, now how can I provide a GTF file,should I download human GTF file from the ensemble and filter my genes?or there is any software to do that? you know I don`t know anything about Linux and finding diff-exp of lncRNA is only part of my tesis (detecting smallRNA,genes,.. doing with the CLC),I need windows software for analysis of lncRNA, do you have any suggestion?
hi,
I haven't used spliceseq, so can't specifically. But I checked the documentation, and there seems to be a module in there called SpliceSeqDBBuild
which generates the db against whom fastq is aligned. It seemed to say that it uses UCSC as its default source, but options available to use Ensembl as well. Try to generate the db using Ensembl and then do the steps. If at all there is any detectable lncRNA signal in the data, it should appear in the result files. The software seemed comprehensive so look around in the documentation more and you should be able to get some answers.
On a side note, you mentioned smallRNA in the last post. Are you looking for them in this RNA-seq data? A standard RNA-seq data wouldn't be able to catch smallRNAs and hence people do small RNA-seq. I hope you haven't confused one with the other. I would suggest again to read around first before jumping into analysis. You would do justice to your time and others as well.
Finally, you can try Galaxy. Its online, free and needs only a browser (+ good bandwidth if you want to upload large files). If you want to learn, there is extensive doc. available.
Dear Amitim, I sent an E-mail to spliceseq company about database,This is the answer:
Thanks for your interest in SpliceSeq. The currently available SpliceSeq reference databases are based on coding transcripts from Ensembl. The transcript records are imported from the UCSC genome database (UCSC has lots of tracks from various sources including Ensembl). I currently do not have a reference database for lncRNA analysis. In principal, this could be constructed from Ensembl transcripts by modifying the current filter that selects for only protein coding transcripts. I think it is likely though that some program modifications would be needed though as the current system includes things like protein coding start / stop position that would not be available. I have had requests for a reference DB for lncRNAs and will probably get around to building one in the future but I don’t know if that will be fast enough for you. I can provide the code for the current SpliceSeqDBBuild program if you want to try to modify it yourself but this is likely to be a non-trivial piece of work.
What is your opinion about that?
hi, Its clear that out-of-box sol. for lncRNA quantification using SpliceSeq is not available. Depending on how much time you have and effort you are ready to put in, you can try these, in order of time required - 1) Already pointed by genomax in your other thread the use of CLC for lncRNA. And at least in one of the papers mentioned there, Transcript Discovery plugin was used. I have no idea what specifically it does as most here use open source s/w. But if you have CLC license, the best option is to ask customer support. 2) If you do not have license, and are limited to Windows OS, try GenePattern. It can be locally installed or used online. Has comprehensive RNA-seq tools. Advntg is you do not need to upload fastq data if you install locally. 3) Have already mentioned Galaxy in prev. post.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
RNA-seq data cannot be possibly be fasta.... raw RNA-seq data are reads, usually present as fastq or bam files. FASTA files may indicate an attempted denovo assembly, if what you have is assembled transcripts, I would just blastx them all, those returning a hit are likely protein encoding mRNA. So focus on the rest.
Hello Amitm
I have already fastq file and FPKM, some of them with unknown ID (like -), so I am thinking these could ncRNA, but I do not know how to detect them. I am also new for RNA seq analysis and I need to learn more. Shall I use the same steps as you mention.
I appreciate your answer
Shaima
Hi,
If you gave a comprehensive GTF file (like Gencode), then in the final FPKM file you would have names (with detectable FPKM) for any lncRNA present in your data. I am not sure which program you used for FPKM quantification, but those unknown IDs mean that the the transcript structure (with the unknown ID) didn't match any of the given transcripts in the GTF file provided. In case of humans where rich annotation information is available, you could be sure that most lncRNAs would already be represented and unknown IDs mostly would be due to transcriptional noise couple with low read depth.
If you want to investigate further then I suggest you do this. Take those unknown IDs and make a BED file of their coordinates. Then using the BED of known protein coding genes, do a subtraction (Use bedtools). That ways you would filter out possible unresolved isoforms of protein coding genes and focus only on candidates that do not overlap such genomic regions. Then maybe do another subtraction with BED of lncRNAs. That ways you ensure that you are now looking at candidates that are no way known mRNAs or known lncRNAs.
Once you have the filtered BED then you can do lots of things to explore but basically you would be on you own now as this becomes predicting novel lncRNAs. I am not sure if this is your goal. If it is then the lowest hanging fruit is to prioritize the candidates based on FPKM. Then you could use the ENCODE resource for looking for promoter/ histone marks around your candidates. etc. etc..