Hi,
I have a set of 2,228 whitefish (salmonid) genes assembled from an exon capture chip. These sequences each correspond to a complete or a partial gene sequence. Since it's genomic DNA, I would like to find the exons within these sequences. So far, I have used a "blastx" approach using D. rerio's coding sequences, complementing with "nr" database for the sequences with no hits. My Python script can identify many putative exons, but I also have the feeling that it misses a lot of them.
I know I can use an "ORFs" approach by using for example EMBOSS (getorf or sixpack) but the problem is : how to choose the ORFs that correspond to true exons among the many results that I get when identifying ALL the ORFs in my sequences? If I use a length filter (e.g ORFs > 300 bp), will I miss some small or partial exons?
I don't know if all of my gene sequences are of VERY good quality, several of them are, but since it's the result of a de novo assembly with genomic DNA and considering the fact that there is a lot of repeated sequences in the whitefish genome, some might be crappy.
So, in brief, my major concern is how to know that the ORFs identified as exons are not partially or entirely in introns and how can I implement a filtering method to keep only the good ones?
Thank you very much for any help or suggestion!
Thank you very much for the help! Now I see how I can build some kind of pipeline analysis according to what you wrote. This is always more complicated to deal with coding regions with gDNA in a "non model" species (i.e no reference genome). Cheers !