Question

Trinity assembly validation and statistics

2

Entering edit mode

9.7 years ago

giorgiocasaburi ▴ 90

Hi All,

After one week Trinity finally completed an assembly starting with 800 million reads (an entire Next Seq 500 run). The statistics are weird, although there were tons of sequences, but I would like your opinion:

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':    858807
Total trinity transcripts:    924905
Percent GC: 40.20
########################################
Stats based on ALL transcript contigs:
########################################
    Contig N10: 1739
    Contig N20: 769
    Contig N30: 490
    Contig N40: 382
    Contig N50: 324
    Median contig length: 270
    Average contig: 363.98
    Total assembled bases: 336649200
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
    Contig N10: 1123
    Contig N20: 575
    Contig N30: 421
    Contig N40: 349
    Contig N50: 304
    Median contig length: 268
    Average contig: 341.45
    Total assembled bases: 293239534

I used Trinity with default parameters and using --trimmomatic plus --min_kmer_cov 2. I really was expecting the N50 to be bigger. What can be the reason for that?

Note: Before starting the assembly I quality filtered the sequences and merged the results in two big paired end fasta files.

Please any advice can be precious!

Thanks!

Giorgio

n50 transcriptome assembly trinity • 6.0k views

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by giorgiocasaburi ▴ 90

1

Entering edit mode

That's a lot of transcripts. Try using the read normalization parameter. What species are you assembling?

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Thanks for your answer. I know it is a lot and I did not run the digital normalization, maybe I should have considering that I have close to a billion raw reads. Do you think that might be the reason of such low N50? Anyway the species is an Hawaiian Squid (no genome annotation at all since it has tons of STR), predicted genome of about 3.8 GB.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by giorgiocasaburi ▴ 90

Ram · Answer 1 · 2015-04-16

1

Entering edit mode

9.7 years ago

seta ★ 1.9k

Hi, it sounds that your assembly is so fragmented that may resulted from using -min_kmer_cov 2. Try --normalize_reads and keep -min_kmer_cov as default. It took long time to finish, could you please tell us what's you system (RAM and CPU)?

ADD COMMENT • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by seta ★ 1.9k

0

Entering edit mode

Thanks seta! I have access to an HPC at my University. For that above analysis I ran with 20 cores with 19GB of ram each. It took a week total. Also, I forgot to mention that the analysis was ran on a unique file that I merged from several samples processed in the same run. Total of 23 libraries (4 fastq files each line paired end; so total of 8 files actually) and an overall total of 184 fastq files (~1 Gb each).

A) Could the fragmentation in your experience be due by the merging of the sample?

B) Should I run a distinct assembly for every single library?

C) Or should I just I considered one total file merged and run the normalization with -min_kmer_cov 1 (default) like you said?

Note: The libraries are from the same tissue but total different treatment if that matters at all.

Thanks!

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by giorgiocasaburi ▴ 90

0

Entering edit mode

If your end goal is to look at the DEGs from your different treatments then you want a single assembly with all samples done at the same time. Individual assemblies will be difficult to analyse. When you say 'unique file' did you merge all your paired end reads into one file? If so, that's why it took so long. I suggest merging all your forward reads into one file, and all your reverse reads into a second. Use params -left and -right to designate read orientation. Also, please reply in the comment sections, opposed to make an answer to your question.

ADD REPLY • link 9.7 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Sorry I mean unique file as one merged with the R1 reads and another merged with the R2 reads. I did use left and right obviously. The goal is indeed compare the treatment so I would stuck with the pooling approach. I am currently running the assembly again using normalization and -min_kmer_cov 1. I actually didn't even use the merged files but rather pass everything to Trinity --left and --right comma separated (e.g. --left library_1_R1.fastq, library_2_R1.fastq, library_N_R1.fastq --right --left library_1_R2.fastq, library_2_R2.fastq, library_N_R2.fastq) although I don't think it would make any difference as Trinity will merge everything anyway.

ADD REPLY • link updated 2.5 years ago by Ram 44k • written 9.7 years ago by giorgiocasaburi ▴ 90

Ram · Answer 2 · 2015-04-16

0

Entering edit mode

9.7 years ago

seta ★ 1.9k

As Stehan said, if your goal is to assess DEG, single assembly is better, but you need high quality assembly to do successful downstream analysis. more read may cause high fragmentation in your assembly, so you can try assembly in individual libraries and then merge them using some tool like, TGICL or EvidentialGene tool. Also, you can try multiple-kmer assembler like soap-denovo or CLC, work with higher k-mer to get rid of lots of transcripts. your case is interesting, please come back to the post whenever you get desired results using any pipeline.

ADD COMMENT • link 9.7 years ago by seta ★ 1.9k

0

Entering edit mode

Hello! I'm not sure if this is a comment or not, but definitely not an answer... I have a few questions...

I have recently completed a trinity assembly with about the same amount of data as the initial post, but for quail. Shy of 1.4 billion reads on a NextSeq 500. I did run the default --min_kmer_cov, --normailize_reads. Similarly, all R1 and R2 were each merged into a single files, and passed using --left and --right. I did modify the --trimomatic settings to be a bit more stringent, thus we filtered out 15% of the reads, and assembled the transcriptome with remaining 85% of the data (normalised). The assembly took roughly the same time, 10 d, with about 20-30 failed commands that were re-ran, until the assembly was successful. The intention is quite the same, run a differential expression analysis between treated and controls used in the assembly.

My concern is the number of "genes" that is being reported 685.214 (see statistics below). Were you able to run any of the multiple-kmers assemblers suggested here?

We obtained fantastic overall mapping rates using TopHat2, about 88-95% of the reads from the individual libraries from the sampels; the one problem we're observing is that we have have 60-65% of the reads mapping to multiple locations (see alignment summary further below). I'm assuming it's due to the number of isoforms per gene. Has anyone here been able to merge/combine of the isoforms into a single gene model as to avoid the redundancy ? If so, could you share what you've done? It would be really nice to have a tool that will construct and merge the sequences and provide you a gtf to run with TopHat.

Thank you in advance,
Rodrigo

Our assembly statistics are:

################################
## Counts of transcripts, etc.
################################

Total trinity 'genes':    685214
Total trinity transcripts:    924982
Percent GC: 42.37

########################################
Stats based on ALL transcript contigs:
########################################

    Contig N10: 5546
    Contig N20: 4009
    Contig N30: 3096
    Contig N40: 2421
    Contig N50: 1876

    Median contig length: 502
    Average contig: 1012.59
    Total assembled bases: 936624846

#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

    Contig N10: 5128
    Contig N20: 3537
    Contig N30: 2588
    Contig N40: 1904
    Contig N50: 1386

    Median contig length: 435
    Average contig: 829.66
    Total assembled bases: 568491645

==> TopHat/.../align_summary.txt <==

Left reads:
          Input     :  32813854
           Mapped   :  30971280 (94.4% of input)
            of these:  19899801 (64.3%) have multiple alignments (131621 have >20)
Right reads:
          Input     :  32813854
           Mapped   :  30304829 (92.4% of input)
            of these:  19594664 (64.7%) have multiple alignments (131618 have >20)
93.4% overall read mapping rate.

ADD REPLY • link updated 5.1 years ago by Ram 44k • written 9.4 years ago by rodrigo.gularte • 0