When designing oligos for a microarray from ESTs, it seems to be crucial to choose the correct direction (strand) for the oligos, but I can't seem to find anything in the literature on this, or how to do this. (I've written a small tool to conditionally reverse-complement ESTs using a dynamic programming algorithm that takes into account BlastX hits and poly-A etc, but I'm unsure how important this is for the results.)
I tried to rely on strand annotation from dbEST, but soon realized it is often wrong. I think this is due to clones being inserted the wrong way into the vector, I guess the rate this happens depends on the kit being used. This also means that although the researcher "knows" the orientation, she will often be wrong.
You may also try to construct rRNA library and for using it with SeqClean.
2) check your ESTs for common repeats. While some transposons are expressed, I am not sure if you want to have some pre-mRNA intronic sequences on the chip.
3) did you tried assembling all your ESTs?
4) you may also tblastx your ESTs against these from related species. There may not be any proteins @NCBI covering the less conserved protein parts from your species of interest.
Thanks for bringing this up. I have had really poor results from the various EST cleaning tools, some old notes at http://blog.malde.org/index.php/2008/05/08/cleaning-up-sequences/. Do you have any independent resources demonstrating the effectiveness of these tools?
ADD REPLY
• link
updated 5.2 years ago by
Ram
44k
•
written 14.1 years ago by
Ketil
4.1k
0
Entering edit mode
If the question is: did I create an artificial "EST" set with vectors/ribosomal sequences thrown in, then point mutated/ mutated with indels/flipped (all this can be seen in real EST data), then looked at results, then answer is: not yet. I am mapping 3+ millions of various species ESTs to a novel genome, some part of EST sets I mapped without seqclean pre-processing, so I hope to see the difference. GMAP, which I use for ESTs mapping is often overeager to call a match (on a protein level) with UTRs, and possibly with other sequences, so less dubious ends should give less nonsense matches.
I tried to rely on strand annotation from dbEST, but soon realized it is often wrong. I think this is due to clones being inserted the wrong way into the vector, I guess the rate this happens depends on the kit being used. This also means that although the researcher "knows" the orientation, she will often be wrong.