Question

Rna Seq On Unannotated Genome

-1

Entering edit mode

11.8 years ago

kanwarjag ★ 1.2k

I have a RNA seq data which is from a an unannotated eukaryotic species. The reference genome is available but is not annotated. What should be the best way to assign genes from RNA-seq aligned data or I am just off the track. I know we can map it to unannotated genome, however if I cannot tell which genes up or down then RNA seq has no value. Thanks

rna-seq genome • 9.4k views

ADD COMMENT • link updated 11.8 years ago by fridhackery ▴ 170 • written 11.8 years ago by kanwarjag ★ 1.2k

0

Entering edit mode

Can you add more information ? Is your RNA-seq paired-end ? One good check to understand if the reference genome is any good is to take a sample of reads (e.g. 100) from both the first and second FASTQ and BLAST them. If the genome assembly is poor, you will see one pair of a read map to a different scaffold to the other read. You know with paired-end RNA-seq that your reads must be less than 1000 bases apart on the same chromosome (and typically much closer - that is a conservative threshold). If they aren't, then you will have to do de-novo assembly.

ADD REPLY • link 11.8 years ago by dario.garvan ▴ 520

0

Entering edit mode

It is PE 50 bp. The genome reference seq is shotgun seq however there are few genes which have been reported in literature for specific coordinates but most of the seq do not have any gene names maped.

ADD REPLY • link 11.8 years ago by kanwarjag ★ 1.2k

score 6 · Answer 1 · 2013-06-28

6

Entering edit mode

11.8 years ago

fridhackery ▴ 170

First map all your reads to the reference resulting in a bam file. Use bedtools genomecov to extract the positions with coverage greater than zero. These are the locations of all your transcribed sequences; basically your own annotation. Use fastaFromBed to get out the sequences from the reference genome and blastx to your closest well annotated species. This will give names for each protein coding transcript.

Then use something like the readcoverage.pl script included in the PolyCat package: http://128.192.141.98/CottonFiber/pages/estlib/PolyCat.aspx and your home-made annotation to calculate coverage across your transcribed regions.

ADD COMMENT • link 11.8 years ago by fridhackery ▴ 170

0

Entering edit mode

That is exactly what I want to know!

ADD REPLY • link 11.8 years ago by kanwarjag ★ 1.2k

0

Entering edit mode

Gee, I thought I suggested the same thing (via the tuxedo suite). I'm glad you understand it now :)

ADD REPLY • link 11.8 years ago by seidel 11k

score 2 · Answer 2 · 2013-06-28

2

Entering edit mode

11.8 years ago

seidel 11k

"Un-annotated" is a somewhat ambiguous term in this context. If you mean to say that there are no known genes in your organism, then you will have to do de novo transcriptome assembly with your RNA Seq data to describe the transcriptome, and then you can go back and quantify your transcriptome from the RNA Seq data. You could do your assembly with or without the genome reference (i.e. reference guided vs. completely de novo). It's a lot of work either way, and there are caveats to each approach.

However it could also be the case that you have fasta sequences describing some genes in your organism, but these sequences are not annotated to the genome assembly. Since many RNA seq pipelines depend on the genome and some annotation describing how genes fall on the genome, you may simply be confused about how to proceed given a genome but no GTF or GFF descriptions of genes. If you have gene sequences, you can use RNA Seq data to quantify them in the absence of the genome (and thus in the absence of any "annotation"). You would simply make an alignment index of your gene sequences, align reads, and then count alignments. There are several genomes for which one can find 10's of thousands of gene sequences, yet there are no mappings to the "reference genome" which is usually a messy pile of contigs. Counting reads on the gene models is a quick and dirty way to do basic gene expression experiments to make progress in biology in the absence of a mature genome, or any "annotation".

ADD COMMENT • link 11.8 years ago by seidel 11k

0

Entering edit mode

Thanks for the comments. To clarify- genome has few genes mapped reported in literature but other than that there are no genes assigned to full genome. So what I was thinking that since no (most of) genome is lacking assignment of specific genes it may not be useful to construct transcriptome without reference or with reference as I cannot find which genes are differentially expressed.

ADD REPLY • link 11.8 years ago by kanwarjag ★ 1.2k

1

Entering edit mode

Let's not put the cart before the horse. The purpose of RNA Seq for differential expression is to find which genes are differentially expressed. In your organism, there are no or few genes defined. So in your statements: "if I cannot tell which genes up or down then RNA seq has no value", and "since [...] genome is lacking assignment of specific genes it may not be useful to construct transcriptome", the implication is that you have no genes to study. But if you have RNA Seq data - you can define the genes, and then study them. You would assemble your RNA Seq reads into transcripts with or without the help of the reference (e.g. trinity vs. tuxedo suite). These transcripts are then THE GENES. You can then go back and evaluate which GENES are showing evidence of differential expression with your RNA Seq data.

ADD REPLY • link 11.8 years ago by seidel 11k

0

Entering edit mode

Sure I agree it is close to what fridhackery suggested. Thanks

ADD REPLY • link 11.8 years ago by kanwarjag ★ 1.2k

0

Entering edit mode

Isn't it possible to map the RNA-seq reads to the genome and then assemble and quantify them without any GFF/GTF file? Do you know any program that can do assembly without the need of GFF?

ADD REPLY • link 11.5 years ago by bioLife ▴ 50

0

Entering edit mode

Yes, this is a common procedure using tophat (for mapping) and cufflinks (assembly/expression), and what is meant by the "tuxedo suite". Given sequence reads and a genome, it will generate the transcripts (thus producing a gtf file), and quantify them.

ADD REPLY • link 11.5 years ago by seidel 11k