Question

RNA-seq de novo transcript assembly from one gene

0

Entering edit mode

9.0 years ago

Floris Brenk ★ 1.0k

Hi all,

I was wondering whether anyone has experience with de novo transcript assembly of RNA-seq (100 bp PE Illumina reads) of only one gene. We have about 50 RNA-seq library of human tissue and are at the moment only interested in one gene and want to know what all the expressed transcripts are of this gene. Are there specific packages/programs for this? or does anyone has some tips or ideas about this?

RNA-Seq Assembly next-gen • 2.4k views

ADD COMMENT • link 9.0 years ago by Floris Brenk ★ 1.0k

score 2 · Answer 1 · 2016-05-12

2

Entering edit mode

9.0 years ago

Rohit ★ 1.5k

Is there any specific reason for opting denovo assembly but not reference alignment though you already know the gene.

Also if you think denovo would be better, you can go for a reference-guided trinity assembly using your gene of interest. If you think the isoforms are the most interesting, you can also do a complete transcriptome denovo assembly (no guide) and then check how your denovo transcript looks like.

ADD COMMENT • link 9.0 years ago by Rohit ★ 1.5k

0

Entering edit mode

Thanks for your reply. The reason is when I look for this gene (and many others) in the Fantom5 transcription start site database there are many more potential transcription start sites than annotated. So this means that there might be more transcripts present then annotated in e.g. gencode or UCSC. For trinity would you recommend pooling all samples together initially for assembly or do it per library?

ADD REPLY • link 9.0 years ago by Floris Brenk ★ 1.0k

0

Entering edit mode

Pooling samples or not for denovo transcriptomes, this is one question with no obvious answer. By pooling you will get better continuity, but chances of mis-assemblies increase. Without pooling the low-level expressed transcripts will not be missed. Not pooling means more time and multiple runs. If you want to go for a quick and dirty approach why not pool the samples for a start, if you have enough computing power. If you reach the memory-limit, you might have to normalise the data.

ADD REPLY • link 9.0 years ago by Rohit ★ 1.5k

0

Entering edit mode

To prevent high computational work. Do you think I can also just start with the already aligned bam files and extract the reads there from the gene of interest and turn them into .fastq files and continue from there or would this be a to biased approach?

ADD REPLY • link 9.0 years ago by Floris Brenk ★ 1.0k

0

Entering edit mode

You could do that. There should be no bias since you have aligned to whole genome. Are you going to opt for "region" of interest rather than gene of interest (or you would just use the co-ordinates for the longest gene model?) if you suspect that there are additional transcription starts.

ADD REPLY • link 9.0 years ago by GenoMax 151k

0

Entering edit mode

Yes I was indeed thinking more about region of interest and based on some UCSC tracks and the mitranscriptome data.

ADD REPLY • link 9.0 years ago by Floris Brenk ★ 1.0k

score 1 · Answer 2 · 2016-05-13

I have recently run an de novo assembly of a tree genome using 2X100 paired end data from Illumina using Trinity

I did it to find the ortholog of a tomato gen of interest in the tree I am working

I used Trinity using the regular adjustment of the program.

In the first assembly, I got over 860.000 fasta contigs files, and after running a local blast using the tomato sequence as the query, I could find several fasta candidates for my gen of interest. To my surprise, one of this contigs had an extension of DNA over 3700 bases encoding for a protein with more than 1100 amino acids and a degree of homology at the level of the protein higher than 67%. This makes me confident that Trinity is working much better that I was expecting

I run Trinity in the Galaxy web service of the Indiana University. The kindly provide me with free access to it, and I very much recommend it