I assembled plant transcriptome 454 data (non normalised) using trinity after the following
1)pre processing (removal of adaptors, vector contamination) 2)removal of rRna sequences 3)removal of chloroplast and mitochondrial genes using bwa
From 3,70,929 reads, i got 21,486 contigs. When i mapped the reads to the contigs using bwa, only 44,678 reads were used in the assembly. What am i doing wrong here? I randomly blasted the contigs to observe that they share over 90% similarity with related legume proteins (although many were hypothetical). However, only a small percentage of the contigs align to the transcript assemblies of related legumes when mapped using bwa.
The velvet assembly of the same data resulted in 15,323 contigs with lesser n50 value, n90 value, max length etc. MIRA assembly resulted in more contigs and more reads being used but lesser n50, n90 and avg length of contig. Why are only 44,678 reads being used? Any advice is greatly appreciated.
Do you mean 370k reads or 3 million? That would have a big impact on interpreting your read usage. Also, I agree with (22308)3 that Newbler would be a good tool of choice for your data.