Hi,
I am new to the topic of genome annotation and would like to get some advice regarding my planned strategy.
So, I am exploring genetic diversity in maize, in for this purpose I have de-novo assembled a genome of a variant which is supposed to be quite diverse from the reference. I'd now like to annotate the assembly. The strategy I have thought about is as follows:
Since maize has quite a lot of genomic resources, I don't see a reason to go for ab-initio annotation as a first attempt. Rather, I would like to base my annotation upon the existing annotation of the maize reference (B73) and a collection of transcripts which I have collected from previous publications. Unfortunately, I do not have RNA-Seq data coming from the individual I am annotating. I combined multiple transcripts sets and the official annotation and aligned them to my assembled sequence. However, result seems rather noisy, with ~150k predicted genes, which sounds too much.
I am wondering what should be my next step. Maybe I should filter my transcripts set to reduce noise? I have already tried filtering out very short transcripts, but can I do something more sophisticated? Another option I have thought of is putting the transcripts through some clustering algorithm (e.g. OrthoMCL) and then take representative transcripts from each cluster and maybe remove singletons.
Is there a common way to assess the integrity of specific gene annotation? Maybe this could help me remove pseudo-genes and/or random alignments from my annotation results?
Have anyone here done something like this before? Would appreciate any thoughts or advice on the strategy I described here.
Thank you!
You can take a look at RATT which is now part of PAGIT.
Thanks, I wasn't familiar with this software. However, since I'm also interested in detecting new genes not present in the reference annotation, I'd like to use the transcriptomic data on top of that, and this is where most of the noise comes from.