Question

Non-Annotated Long Noncoding Rna Candidates

3

Entering edit mode

13.2 years ago

Stephanhart ▴ 100

I am a biologist and a poor programmer. I have RNA-seq data from which I'd like to extract all potential long ncRNA candidates. I assume the easiest way would be to compare my data to Refseq or Rfam but I don't know how to do this. Furthermore, this strategy would only pick up already annotated lncRNA. I would be glad of any help.

rna scripting • 7.3k views

ADD COMMENT • link updated 10.1 years ago by angeloulivieri • 0 • written 13.2 years ago by Stephanhart ▴ 100

1

Entering edit mode

What type of RNA-seq data, precisely, do you have?

ADD REPLY • link 13.2 years ago by Neilfws 49k

Ram · Answer 1 · 2011-09-28

Here is a coarse overview which steps I regard as essential for this kind of analysis. The actual way to do this best depends on many factors like type of data (Illumina, Solid, 454, paired end single end), read length, preferred programming language or toolbox (aka. R, Perl, Galaxy) the grouping of your data into samples, number of replicates, the alignment tools, etc. which you didn't specify.

for this you will need:

The full genome sequence of your organism (as a FASTA file)
All annotated regions of the genome (including known annotated ncRNAs) (best in a GFF file)
your reads

With that:

Filter reads
Align reads against the whole reference genome
Find regions of high coverage (this is the hard part, the question is how to define it, e.g by cutoff or by significance using replicates) Edit: An idea to calibrate the coverage cutoff required might be to look at the coverage of know RNA-genes.
Remove regions that overlap with annotated exons (or in addition with known RNA genes) Edit: This step could be optional, for the cases GWW mentions, but that will most likely yield almost all transcripts.
Keep regions that are in introns or intergenic regions of the genome and have suitable distance to coding regions.

Some steps of the pipeline and possible follow up analyses are also outlined here: Identified Potential Non-Coding Rna, And Then?

Also have a look at the rna-seq questions here: https://www.biostars.org/t/rna-seq

That way you will only get regions that are new, and far enough away from exons.

This is not too difficult to code in for example R or in a local Galaxy install, but I am not sure how much help it is, depends on which of these steps you can perform yourself. So I would suggest that you get some local support from a (bio)informatics person on-site.

Ram · Answer 2 · 2011-09-29

1

Entering edit mode

13.2 years ago

Eric Fournier ★ 1.4k

As I've noted in location and expression of lincRNA, you should go ahead and read Cabili et al, 2011: Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses for an in-depth example of how to find and annotate lncRNAs from RNA-Seq data.

ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 13.2 years ago by Eric Fournier ★ 1.4k

0

Entering edit mode

Good, finally a paper on this. I would take into doubt however whether or not the transcriptome assembly step is really necessary.

ADD REPLY • link 13.2 years ago by Michael 55k

Ram · Answer 3 · 2011-09-28

0

Entering edit mode

13.2 years ago

Ido Tamir 5.2k

You have a lot to do:

align to genome
assemble the reads to transcripts
comparison with known protein coding or non coding annotations
more advanced identification methods (not really answered there).

There are many ways to do all this, but as a starter you could follow the scripture guide or the tophat + cufflinks tutorial for points 1-2.

Then you end up with some chromosomal regions showing transcriptional activity that you could compare to the known annotations (e.g. overlap) with bedtools.

If you have done all this, I would ask again for 4. (and of course there are other ways to do this (e.g. reference free assembly of the RNA...)

ADD COMMENT • link updated 5.3 years ago by Ram 44k • written 13.2 years ago by Ido Tamir 5.2k

0

Entering edit mode

The sequence is already assembled and I have all the data in BAM files. So it's really only steps 3 and 4 I'm interested in. I would be grateful if you could expound on them somewhat.

ADD REPLY • link 13.2 years ago by Stephanhart ▴ 100

0

Entering edit mode

What is stored in the BAM files? At the end of step 2 you should end up with some transcript model which gives you locations for a transcript not locations for a read. Most of the time these are bed or gtf or similar files. Did you convert these to bam files? If you indeed did an assembly with cufflinks, you could try to compare this to a known annotation with cuffcompare. If you have a bed file with locations try the bedtools to overlap with the known annotation. Extract the sequence of your assembled exons, blast it against a transcript database (refeq, known-genes).

ADD REPLY • link 13.2 years ago by Ido Tamir 5.2k

0

Entering edit mode

I did an assembly with cufflinks.

ADD REPLY • link 13.2 years ago by Stephanhart ▴ 100

0

Entering edit mode

So how was the cuffcompare comparison Did it give you useful output? I don't think you need a command line example for point 3 if you came that far. For 4 just follow the links in the answers that I linked to.

ADD REPLY • link 13.2 years ago by Ido Tamir 5.2k

score 0 · Answer 4 · 2014-12-02

0

Entering edit mode

10.1 years ago

angeloulivieri • 0

You could use Annocript. It gives annotation of transcriptomes and putative long non coding transcripts. You need only to give it the complete fasta transcriptome. You need to assembly it first with a software like trinity if you have raw reads.

ADD COMMENT • link 10.1 years ago by angeloulivieri • 0