Any tips on what to do downstream of a novel transcriptome assembly? I have a few things going based mainly on someone else in my group's previous work:
basic stats about the contigs, length, etc.
check for redundancy in contigs
tabulate read counts per contig / create igv tracks
examine coding potential of contigs
blast against nr (or some subset?) to annotate contigs
I wanted to see if there are any additional things I should be doing. I assembled using trinity, the data is paired-end illumina sequences. I have one sample at this point.
You may want to know the percentage of reads that are singletons or that do not make it into the assembly. I don't find much value in coding potential but more in assessing whether the read looks full-length or not. For this, you could have 3 classes: definitely full-length, definitely not full-length, and undetermined. If coding potential is really something you like, look into upstream ORFs (they likely curtail translation or slow it down).
Sequence similarity to annotate could be done against the genome(s) of the closest species. A set of reference mRNAs or proteins will do this. Rather than search across all of nr - if cannot focus to a few close relatives, I'd stick with a subset that focuses on a kingdom or family of interest - like plants or insects, as the case may be.
You're welcome. With regard to full-length vs not, it might be interesting to look at differences in this for multi- vs single-exon genes, but only if you're real curious and like to show lots of data.
Thanks, those are some good points.
You're welcome. With regard to full-length vs not, it might be interesting to look at differences in this for multi- vs single-exon genes, but only if you're real curious and like to show lots of data.