Non-Annotated Long Noncoding Rna Candidates
4
3
Entering edit mode
13.3 years ago
Stephanhart ▴ 100

I am a biologist and a poor programmer. I have RNA-seq data from which I'd like to extract all potential long ncRNA candidates. I assume the easiest way would be to compare my data to Refseq or Rfam but I don't know how to do this. Furthermore, this strategy would only pick up already annotated lncRNA. I would be glad of any help.

rna scripting • 7.3k views
ADD COMMENT
1
Entering edit mode

What type of RNA-seq data, precisely, do you have?

ADD REPLY
3
Entering edit mode
13.3 years ago
Michael 55k

Here is a coarse overview which steps I regard as essential for this kind of analysis. The actual way to do this best depends on many factors like type of data (Illumina, Solid, 454, paired end single end), read length, preferred programming language or toolbox (aka. R, Perl, Galaxy) the grouping of your data into samples, number of replicates, the alignment tools, etc. which you didn't specify.

for this you will need:

  • The full genome sequence of your organism (as a FASTA file)
  • All annotated regions of the genome (including known annotated ncRNAs) (best in a GFF file)
  • your reads

With that:

  • Filter reads
  • Align reads against the whole reference genome
  • Find regions of high coverage (this is the hard part, the question is how to define it, e.g by cutoff or by significance using replicates) Edit: An idea to calibrate the coverage cutoff required might be to look at the coverage of know RNA-genes.

  • Remove regions that overlap with annotated exons (or in addition with known RNA genes) Edit: This step could be optional, for the cases GWW mentions, but that will most likely yield almost all transcripts.

  • Keep regions that are in introns or intergenic regions of the genome and have suitable distance to coding regions.

Some steps of the pipeline and possible follow up analyses are also outlined here: Identified Potential Non-Coding Rna, And Then?

Also have a look at the rna-seq questions here: https://www.biostars.org/t/rna-seq

That way you will only get regions that are new, and far enough away from exons.

This is not too difficult to code in for example R or in a local Galaxy install, but I am not sure how much help it is, depends on which of these steps you can perform yourself. So I would suggest that you get some local support from a (bio)informatics person on-site.

ADD COMMENT
2
Entering edit mode

There are a few issues with your workflow. A lot of long non-coding RNA's are intergenic, intronic, overlapping with another gene or antisense to a known gene. So they are much harder to identify from normal RNA-seq workflows.

ADD REPLY
1
Entering edit mode

This is very true, and I know this, but I think it is very hard or impossible to resolve. How to differentiate a coding transcript from a non-coding one if both overlap an exon? I guess there are some imperfect attempts. But this pipeline is more than enough for a quick start.

ADD REPLY
1
Entering edit mode
13.2 years ago
Eric Fournier ★ 1.4k

As I've noted in location and expression of lincRNA, you should go ahead and read Cabili et al, 2011: Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses for an in-depth example of how to find and annotate lncRNAs from RNA-Seq data.

ADD COMMENT
0
Entering edit mode

Good, finally a paper on this. I would take into doubt however whether or not the transcriptome assembly step is really necessary.

ADD REPLY
0
Entering edit mode
13.3 years ago
Ido Tamir 5.2k

You have a lot to do:

  1. align to genome
  2. assemble the reads to transcripts
  3. comparison with known protein coding or non coding annotations
  4. more advanced identification methods (not really answered there).

There are many ways to do all this, but as a starter you could follow the scripture guide or the tophat + cufflinks tutorial for points 1-2.

Then you end up with some chromosomal regions showing transcriptional activity that you could compare to the known annotations (e.g. overlap) with bedtools.

If you have done all this, I would ask again for 4. (and of course there are other ways to do this (e.g. reference free assembly of the RNA...)

ADD COMMENT
0
Entering edit mode

The sequence is already assembled and I have all the data in BAM files. So it's really only steps 3 and 4 I'm interested in. I would be grateful if you could expound on them somewhat.

ADD REPLY
0
Entering edit mode

What is stored in the BAM files? At the end of step 2 you should end up with some transcript model which gives you locations for a transcript not locations for a read. Most of the time these are bed or gtf or similar files. Did you convert these to bam files? If you indeed did an assembly with cufflinks, you could try to compare this to a known annotation with cuffcompare. If you have a bed file with locations try the bedtools to overlap with the known annotation. Extract the sequence of your assembled exons, blast it against a transcript database (refeq, known-genes).

ADD REPLY
0
Entering edit mode

I did an assembly with cufflinks.

ADD REPLY
0
Entering edit mode

So how was the cuffcompare comparison Did it give you useful output? I don't think you need a command line example for point 3 if you came that far. For 4 just follow the links in the answers that I linked to.

ADD REPLY
0
Entering edit mode
10.1 years ago

You could use Annocript. It gives annotation of transcriptomes and putative long non coding transcripts. You need only to give it the complete fasta transcriptome. You need to assembly it first with a software like trinity if you have raw reads.

ADD COMMENT

Login before adding your answer.

Traffic: 1722 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6