Question

de novo transcriptome assembly

1

Entering edit mode

8.2 years ago

402374688 ▴ 30

I'm de novo assembling a transcriptome. I have RNA-seq data of treatment and control group with two time points. There are three replicates for each group. When doing the assembly, shall I pool all reads (from both control and treatment groups) to assemble or just use each replicate to do the assembly? Is it ok to pool them together and if assembling for each replicate what I shall do to make it comparable between different groups and differenet timepoints? Thank you.

Assembly RNA-Seq • 3.9k views

ADD COMMENT • link updated 8.2 years ago by Chris Fields ★ 2.2k • written 8.2 years ago by 402374688 ▴ 30

0

Entering edit mode

Treat each replicate separately. Diff expression requires replicates. Never Pool, cringe.

ADD REPLY • link 8.2 years ago by Biogeek ▴ 470

0

Entering edit mode

Yeah, I know when doing the differential expression analysis it should be separated. But when I assemble the transcriptome, no matter treatment or control or different replicates they should have similar genes or transcripts, right? So can I pool them to do the assembly and map each replicate back to do diff expression?

ADD REPLY • link 8.2 years ago by 402374688 ▴ 30

1

Entering edit mode

Just to be clear, everything goes into the one assembly. You should just have one assembly. counts use your individual reads then form a matrix using Trinity pipeline.

ADD REPLY • link 8.2 years ago by Biogeek ▴ 470

0

Entering edit mode

Got it. Really appreciate it. I'm new to assemblies and Thanks for your patient explanation.

ADD REPLY • link 8.2 years ago by 402374688 ▴ 30

0

Entering edit mode

Not necessarily...

Your treatment and control will presumably differ when comapred

Within replicate groups, you may have one replicate which is an outlier, when pulled, how do you determine the rotten egg in the basket?

You may also have one read with contamination etc.

For transcriptome assembly, concatenate your reads in order (keeping the same order for both forward and reverse reads). Remember the transcriptome is an assembly of everything, so by feeding 1 concatenated left read and 1 concatenated right read (presuming you have read pairs) made up of all reads, that's fine. Check the Trinity github page out for some help.It's great if you're new to assemblies, plus it's very beginner friendly.

For abundance counting in RSEM for example, you will provide each set of reads as individual replicates. Pulling in RNA-Seq is a bit cringe-worthy and defeats the purpose. Having an idea of variability is also key. You will also limit yourself to downstream analysis if you pull - being stuck with one replicate. Statistics works off replicates.

ADD REPLY • link 8.2 years ago by Biogeek ▴ 470

score 0 · Answer 1 · 2016-09-01

0

Entering edit mode

8.2 years ago

Chris Fields ★ 2.2k

The approach suggested by the Trinity assembler developers suggests combining all sample reads for the assembly step (possibly using digital normalization to speed the assembly up), then realigns sample reads back to the assembly for filtering and DE analysis (the later step using salmon, RSEM, or alternative tools). This is explicitly stated in the notes for this workflow.

In addition, I also recommend following up assembly with Transrate to assess assembly quality and filter low-quality assembly artifacts, and then transcriptome annotation (I'm biased towards tools like Trinotate though others like commercial tools like BLAST2GO); this helps identify additional elements like rRNA that you can disregard. You can also use this screening for contaminants, if that is a potential issue in your assembly, as BLAST is a typical step for annotation purposes.

ADD COMMENT • link 8.2 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

Thank you. It's really useful for me. I'm a beginner of this area. When using some tools or softwares, I can understand general options which are used in most cases but when it comes to personalized option I just lose myself, for example the choice of k-mer length. It's frustrating when getting puzzled by these kinds of staff.

ADD REPLY • link 8.2 years ago by 402374688 ▴ 30

0

Entering edit mode

I can relate. It's a long process learning; especially if you're new to bioinformatics. I've been doing it now for 2 years and still learn everyday. Stick to Trinity, and a fixed k-mer length of 25. SOAPtrans and velvet etc are more complex to use.

ADD REPLY • link 8.2 years ago by Biogeek ▴ 470

0

Entering edit mode

Yeah, it's in deed a long process especially when my supervisor isn't on bioinformatics. It's obvious when I just got started and easy to waste all day without realizing what to do. As a senior, can you give me some advice on how to get on board of this area (basically we are studying genome of an insect and doing some transcriptome analyses)?

ADD REPLY • link 8.2 years ago by 402374688 ▴ 30

0

Entering edit mode

Mate,

I'm a third year PhD student and I had the same problem ;-) not quite a senior yet haha. I think the main things to do which will help are:

Focus and read up on your approaches. Find an approach and read papers on how other authors implement it. It is not a race, it pays off to do your research before running ahead and applying methods. Use the forums (here and seq answers are good forums, plenty of helpful peeps to help/ advice you).

Ok, you have a genome, do you know how complete it is? For you, I would do a reference guided assembly, it's much easier and if your genome is quite good/complete, use it. You could use STAR aligner to produce your index and your BAM files, then align them to the genome and do counts using something like Cufflinks.

If it's not, go the de novo way. It's messier, but may yield a more complete analysis if your genome is fragmented and not very well annotated.

ADD REPLY • link 8.2 years ago by Biogeek ▴ 470

0

Entering edit mode

Keep in mind that Trinity now also performs reference-guided assemblies

ADD REPLY • link 8.2 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

I de novo assembled several transcriptomes of the same organism and found that with the increase of reads (samples), the size of resulting assembly is larger and larger. But to my knowledege, this should contain redundant transcripts, right? what I want to ask is how to remove these redundant transcripts(maybe we can call it). One more concern is that when removing redundance, is it possible that we lose some genes of the same family or the following quantification steps can be disturbed within the same family? As far as I know, there are following steps that may help: when assembling, use --normalize_reads to limit max read coverage and after trinity assembly, use Tgicl to extend the transcripts and use cd-hit to remove highly similar sequences. Are there some other effective tools or strategies that can help with this?

ADD REPLY • link 8.2 years ago by 402374688 ▴ 30

1

Entering edit mode

Hi,

Yes evidentialgenes - tr2aacds may be a good resort for you. The tool is becoming more increasingly popular. I used to use cd-hit and cap3, although I think clustering can remove important genes. CD-HIT and cap3 do however seem to be used quite a lot. Up to you on what method you want to use.

ADD REPLY • link 8.2 years ago by Biogeek ▴ 470

0

Entering edit mode

Yes, definite +1 for evidentialgenes/tr2aacds.

ADD REPLY • link 8.2 years ago by Chris Fields ★ 2.2k

0

Entering edit mode

I have donwloaded EG and I,ve used the tr2aacds script. One question come to me, maybe I've missed some configuration?

For a beginner, the straightforward use is just run this script and take the .okay subset (.tr .aa or .cds depends on the downstream analysis)? I've read the .doc files and I can't find a "configuration process", it looks like "so easy to use to be fine".

ADD REPLY • link 6.8 years ago by pablo61991 ▴ 90