Cleaning up trinity assemblies
1
0
Entering edit mode
9.2 years ago
dpearton • 0

Hi,

I have performed de novo assemblies on mRNA using trinity with PE illumina data (125bp PE). Trinity gives a large number of contigs (around 250,000) but many of these are very short (200-500bp). In addition there can be many versions of the same contig - either different isoforms, splice variants or variant assemblies. Is there a way to "rationalise" or collapse the assembly? For example only taking the longest isoform of each contig? I know that would possibly be throwing away any info on splice variants but that is not something that I'm too interested in at the moment. And/or having a size cut-off?

Assembly stats are as follows:

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':  185527
Total trinity transcripts:      252342
Percent GC: 40.60

########################################
Stats based on ALL transcript contigs:
########################################

        Contig N10: 4739
        Contig N20: 3372
        Contig N30: 2579
        Contig N40: 1972
        Contig N50: 1452

        Median contig length: 406
 Average contig: 803.80
        Total assembled bases: 202831907


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

        Contig N10: 4338
        Contig N20: 2823
        Contig N30: 1957
        Contig N40: 1258
        Contig N50: 802

        Median contig length: 357
        Average contig: 621.99
        Total assembled bases: 115396193

Thanks,
Dave

RNA-Seq Assembly • 5.1k views
ADD COMMENT
0
Entering edit mode
9.2 years ago
seta ★ 1.9k

If the short contig is not your interest, you can easily apply the flag of --min_contig_length 400 or 500, for example. However, be careful about it as some of protein sequences have short length, then you may miss them. Although there is a script to get the longest isoform, the longest transcript is not always the best one, so you can consider filter the lowly supported transcript using RSEM output. Hope this helps.

ADD COMMENT
0
Entering edit mode

Hi,

I know the longest might not be the "best" but I'm not sure what other criteria to use. I imagine that reads/contig normalised for length might be useful but I have no idea how to implement that. I've visualised my assemblies in tablet and there is a wide range of reads/contig.

I've not used RSEM - how would this work without a reference genome?

ADD REPLY
0
Entering edit mode

I usually use the min_contig_length 300 to get rid of many short contigs, you can type just --min_contig_length 300 along with your trinity command. About RSEM, please use the align_and_estimate_aboundance.pl script within Trinity package then using RSEM output, you can filter contigs with fpkm less than 1. You can take a look at http://trinityrnaseq.sourceforge.net/analysis/abundance_estimation.html

ADD REPLY

Login before adding your answer.

Traffic: 1625 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6