I'm working with an older Agilent microarray platform, and I need to update the annotation. The problem: given a list of 10,000 sequences of length 60, identify those which unambiguously map within the coding region or UTR of a well-annotated gene, and record the Entrez ID for that gene. By "unambiguous" I mean a single gene matches all 60 bases with 100% identity. Cases where gene FOO maps perfectly but gene BAR has 100% identity for 27 bases should be rejected as ambiguous. Since the query sequences were generated from ESTs, I need to accept results where there is 100% identity but the alignment spans exons of the same gene.
The brute force solution is to feed a local BLAT instance the hg19 build and the sequences, parse the output for start-stop loci, and match those against exon bounds for the whole genome pulled from UCSC. That's not a fun way to spend an afternoon.
Can you think of a method that requires less effort?
Is there a reason not to align to mRNA?
Good point, since I only care about perfect matches to mRNA, ideally refseq.