Question

How can I detect lncRNA?

2

Entering edit mode

8.9 years ago

saj98 ▴ 140

Hello everyone

I have RNAseq data of rat species and I want to detect lncRNA. Does anyone have any idea about which soft ware or tool that I can use to analyse my FASTA file?

Best
Shaima

RNA-Seq • 5.4k views

ADD COMMENT • link updated 2.4 years ago by Ram 44k • written 8.9 years ago by saj98 ▴ 140

1

Entering edit mode

RNA-seq data cannot be possibly be fasta.... raw RNA-seq data are reads, usually present as fastq or bam files. FASTA files may indicate an attempted denovo assembly, if what you have is assembled transcripts, I would just blastx them all, those returning a hit are likely protein encoding mRNA. So focus on the rest.

ADD REPLY • link 8.9 years ago by apelin20 ▴ 480

0

Entering edit mode

Hello Amitm

I have already fastq file and FPKM, some of them with unknown ID (like -), so I am thinking these could ncRNA, but I do not know how to detect them. I am also new for RNA seq analysis and I need to learn more. Shall I use the same steps as you mention.

I appreciate your answer

Shaima

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 8.9 years ago by saj98 ▴ 140

1

Entering edit mode

Hi,

If you gave a comprehensive GTF file (like Gencode), then in the final FPKM file you would have names (with detectable FPKM) for any lncRNA present in your data. I am not sure which program you used for FPKM quantification, but those unknown IDs mean that the the transcript structure (with the unknown ID) didn't match any of the given transcripts in the GTF file provided. In case of humans where rich annotation information is available, you could be sure that most lncRNAs would already be represented and unknown IDs mostly would be due to transcriptional noise couple with low read depth.

If you want to investigate further then I suggest you do this. Take those unknown IDs and make a BED file of their coordinates. Then using the BED of known protein coding genes, do a subtraction (Use bedtools). That ways you would filter out possible unresolved isoforms of protein coding genes and focus only on candidates that do not overlap such genomic regions. Then maybe do another subtraction with BED of lncRNAs. That ways you ensure that you are now looking at candidates that are no way known mRNAs or known lncRNAs.

Once you have the filtered BED then you can do lots of things to explore but basically you would be on you own now as this becomes predicting novel lncRNAs. I am not sure if this is your goal. If it is then the lowest hanging fruit is to prioritize the candidates based on FPKM. Then you could use the ENCODE resource for looking for promoter/ histone marks around your candidates. etc. etc..

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 8.9 years ago by Amitm ★ 2.3k

Ram · Answer 1 · 2016-01-07

1

Entering edit mode

8.9 years ago

Amitm ★ 2.3k

Hi,

As apelin20 said, I am not sure what format your reads are in. Assuming you have reads in fastq format, use any splice-aware aligner (STAR, TopHat, HSAT2 etc.) to align reads and provide a GTF file for annotation. Get this GTF from gencode which has comprehensive anno. for (protein coding and) lncRNAs. Once you have used a downstream FPKM quantifier like Cufflinks or StringTie, you would your detected lncRNAs (along with protein coding mRNAs).

If somehow you have a fasta file (though it seems less probable why would you do de novo assembly for mouse. Or maybe fastq converted to fasta just to save space. Not sure), then get the fasta file for the lncRNAs from here and build a custom BLAST database and use blastn/ megablast to arrive to lncRNA supporting reads.

You could also build custom db for BWA or Bowtie using the fasta from Gencode and then do alignment. The alignment here would be faster but if your query file is fasta then you would need to specify that to the aligner to assume fasta input.

ADD COMMENT • link updated 5.0 years ago by Ram 44k • written 8.9 years ago by Amitm ★ 2.3k

0

Entering edit mode

hello Amantin,

I want detect and get Differential Expression lncRNA from some human (control and treatment) RNA-seq data ( Illumina sequencing data) in fastq format,in your opinion how can I compare control and treatment lncRNA that I detected from your workflow? for detecting genes from these data, I use CLC Genomics and get Differential Expression.

Your attention would be really appreciated.

ADD REPLY • link 8.4 years ago by Edalat ▴ 30

0

Entering edit mode

hi, For better coverage of lncRNA, total RNA-seq is advisable as compared to polyA selected RNA-seq. In any case, you should be able to find out what is the annotation file being used in the CLC Genomics software. If that annotation file contains lncRNA (along with protein-coding genes), then any diff-exp lncRNA, if present, would be in the result too. Since, CLC is paid s/w, best ask them how to include lncRNAs in the pipeline.

ADD REPLY • link 8.4 years ago by Amitm ★ 2.3k

0

Entering edit mode

Thank you for your attention. Yes,I ask them too. When I did "RNA-seq analysis" in clc genomics, I received two outputs GE&TE. I get Differential Expression for both of them, now is it true, analyzing TE that have fold change>1.5 and p-value<.05 for getting diff-exp lncRNA. I mean align only TE that pass the filter?

ADD REPLY • link 8.4 years ago by Edalat ▴ 30

0

Entering edit mode

hi, I haven't used that paid s/w and hence have no idea what do GE/ TE stand for. Maybe gene & transcript. Yes, that could be a valid criteria. But it also depends on sample type, strength of 'treatment'/ bio. effect, among other things, to settle on a good threshold.

ADD REPLY • link 8.4 years ago by Amitm ★ 2.3k

0

Entering edit mode

yes,GE means gene and TE means transcript. At first I should detect introns of filtered TE by one splicing software,I mean define gaps between transcripts of genes and get sequence of introns, but after that what should I do? Getting GTF from gencode?

ADD REPLY • link 8.4 years ago by Edalat ▴ 30

1

Entering edit mode

sorry, you are not making sense. I am not sure what definition are you applying for calling lncRNA. They need not necessarily be in introns of known genes; they can be independent transcriptional units or embedded around known genes. Your s/w I think would have already used a splice-aware method to perform the alignment. The lowest hanging fruit would be to exploit known lncRNAs using annotation resources like Gencode. But, to do so you should ensure that you have used the GTF before the diff. exp. steps happened, i.e. during alignment and transcript quantification. Also, it might be a good idea to find what are recommended steps for rna-seq analysis. This article might be of use.

ADD REPLY • link 8.4 years ago by Amitm ★ 2.3k

0

Entering edit mode

Thank you for your attention,You are right about lncRNA,I was confused.

ADD REPLY • link 8.4 years ago by Edalat ▴ 30

0

Entering edit mode

hello Dear Amantin, I again came back with a number of questions. Amantin,last night I used spliceseq software (windows operating systems) to align reads, now how can I provide a GTF file,should I download human GTF file from the ensemble and filter my genes?or there is any software to do that? you know I don`t know anything about Linux and finding diff-exp of lncRNA is only part of my tesis (detecting smallRNA,genes,.. doing with the CLC),I need windows software for analysis of lncRNA, do you have any suggestion?

ADD REPLY • link 8.4 years ago by Edalat ▴ 30

1

Entering edit mode

hi, I haven't used spliceseq, so can't specifically. But I checked the documentation, and there seems to be a module in there called SpliceSeqDBBuild which generates the db against whom fastq is aligned. It seemed to say that it uses UCSC as its default source, but options available to use Ensembl as well. Try to generate the db using Ensembl and then do the steps. If at all there is any detectable lncRNA signal in the data, it should appear in the result files. The software seemed comprehensive so look around in the documentation more and you should be able to get some answers.

On a side note, you mentioned smallRNA in the last post. Are you looking for them in this RNA-seq data? A standard RNA-seq data wouldn't be able to catch smallRNAs and hence people do small RNA-seq. I hope you haven't confused one with the other. I would suggest again to read around first before jumping into analysis. You would do justice to your time and others as well.

Finally, you can try Galaxy. Its online, free and needs only a browser (+ good bandwidth if you want to upload large files). If you want to learn, there is extensive doc. available.

ADD REPLY • link 8.4 years ago by Amitm ★ 2.3k

0

Entering edit mode

Thanks for your help and kindness.yes I catch smallRNAs from small RNA-seq.

ADD REPLY • link 8.4 years ago by Edalat ▴ 30

0

Entering edit mode

Dear Amitim, I sent an E-mail to spliceseq company about database,This is the answer:

Thanks for your interest in SpliceSeq. The currently available SpliceSeq reference databases are based on coding transcripts from Ensembl. The transcript records are imported from the UCSC genome database (UCSC has lots of tracks from various sources including Ensembl). I currently do not have a reference database for lncRNA analysis. In principal, this could be constructed from Ensembl transcripts by modifying the current filter that selects for only protein coding transcripts. I think it is likely though that some program modifications would be needed though as the current system includes things like protein coding start / stop position that would not be available. I have had requests for a reference DB for lncRNAs and will probably get around to building one in the future but I don’t know if that will be fast enough for you. I can provide the code for the current SpliceSeqDBBuild program if you want to try to modify it yourself but this is likely to be a non-trivial piece of work.

What is your opinion about that?

ADD REPLY • link 8.4 years ago by Edalat ▴ 30

1

Entering edit mode

hi, Its clear that out-of-box sol. for lncRNA quantification using SpliceSeq is not available. Depending on how much time you have and effort you are ready to put in, you can try these, in order of time required - 1) Already pointed by genomax in your other thread the use of CLC for lncRNA. And at least in one of the papers mentioned there, Transcript Discovery plugin was used. I have no idea what specifically it does as most here use open source s/w. But if you have CLC license, the best option is to ask customer support. 2) If you do not have license, and are limited to Windows OS, try GenePattern. It can be locally installed or used online. Has comprehensive RNA-seq tools. Advntg is you do not need to upload fastq data if you install locally. 3) Have already mentioned Galaxy in prev. post.

ADD REPLY • link 8.4 years ago by Amitm ★ 2.3k