Question

Transcriptome assembly: Low GC, short contigs, low read alignment

0

Entering edit mode

7.7 years ago

jmah ▴ 30

Hi!

I am trying to troubleshoot two de novo Trinity assemblies. They were sequenced during the same run for two species of sponge, and I obtained 2x150 bp reads to a depth of 124x. We already have a whole transcriptome for each species assembled, but for our purposes I would like a de novo assembly. The GC content of my new assemblies are 3-7% lower than our old assemblies. Furthermore, my assemblies have many short contigs (ie. N50: 800 bp, cf. to 1800 bp of the old assemblies, median length: 300 vs 800 bp, mean length: 600 vs 1200 bp). The nail on the coffin is that there are few reads aligned in proper paired orientation when mapped back to my de novo assemblies: ~50% in proper pairs.

I am most worried about the GC content. GC content of the reads are similar to our old transcriptomes and only lower after assembly. I have changed adapter trimming parameters and tried out the jaccard clip setting for Trinity, but my assembly stats remain almost identical each run.

Has anyone received assemblies with low GC and short contigs before? If so, what did you do to fix that?

Thanks! If there's any more information that can prove helpful, please let me know.

RNA-Seq GC content troubleshooting • 2.2k views

ADD COMMENT • link 7.7 years ago by jmah ▴ 30

0

Entering edit mode

If you're using Trinity, with that much depth, you might want to use the in silico read normalization parameter. Also, why assemble de novo instead of reference based, if you have other assemblies? if you're looking for DEGs, combine all avalable data to create a single assembly, then align your samples back to the assembly to get abundance estimates.

ADD REPLY • link 7.7 years ago by st.ph.n ★ 2.7k

score 0 · Answer 1 · 2017-03-16

0

Entering edit mode

7.7 years ago

jmah ▴ 30

Hi st,ph.n,

Thanks for you advice! I did use normalization, and yes my goal is to create a reference assembly for DE. I would rather not use a reference, because the goal is to find new genes and uncharacterized sequences. Any ideas about the low GC content?

Thanks!

ADD COMMENT • link 7.7 years ago by jmah ▴ 30