Hi,
I have performed de novo assemblies on mRNA using trinity with PE illumina data (125bp PE). Trinity gives a large number of contigs (around 250,000) but many of these are very short (200-500bp). In addition there can be many versions of the same contig - either different isoforms, splice variants or variant assemblies. Is there a way to "rationalise" or collapse the assembly? For example only taking the longest isoform of each contig? I know that would possibly be throwing away any info on splice variants but that is not something that I'm too interested in at the moment. And/or having a size cut-off?
Assembly stats are as follows:
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 185527
Total trinity transcripts: 252342
Percent GC: 40.60
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 4739
Contig N20: 3372
Contig N30: 2579
Contig N40: 1972
Contig N50: 1452
Median contig length: 406
Average contig: 803.80
Total assembled bases: 202831907
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 4338
Contig N20: 2823
Contig N30: 1957
Contig N40: 1258
Contig N50: 802
Median contig length: 357
Average contig: 621.99
Total assembled bases: 115396193
Thanks,
Dave
Hi,
I know the longest might not be the "best" but I'm not sure what other criteria to use. I imagine that reads/contig normalised for length might be useful but I have no idea how to implement that. I've visualised my assemblies in tablet and there is a wide range of reads/contig.
I've not used RSEM - how would this work without a reference genome?
I usually use the
min_contig_length 300
to get rid of many short contigs, you can type just--min_contig_length 300
along with your trinity command. About RSEM, please use thealign_and_estimate_aboundance.pl
script within Trinity package then using RSEM output, you can filter contigs with fpkm less than 1. You can take a look at http://trinityrnaseq.sourceforge.net/analysis/abundance_estimation.html