Pipeline To Map 60-Mers To Genes
3
2
Entering edit mode
13.2 years ago

I'm working with an older Agilent microarray platform, and I need to update the annotation. The problem: given a list of 10,000 sequences of length 60, identify those which unambiguously map within the coding region or UTR of a well-annotated gene, and record the Entrez ID for that gene. By "unambiguous" I mean a single gene matches all 60 bases with 100% identity. Cases where gene FOO maps perfectly but gene BAR has 100% identity for 27 bases should be rejected as ambiguous. Since the query sequences were generated from ESTs, I need to accept results where there is 100% identity but the alignment spans exons of the same gene.

The brute force solution is to feed a local BLAT instance the hg19 build and the sequences, parse the output for start-stop loci, and match those against exon bounds for the whole genome pulled from UCSC. That's not a fun way to spend an afternoon.

Can you think of a method that requires less effort?

annotation blat microarray • 3.1k views
ADD COMMENT
1
Entering edit mode

Is there a reason not to align to mRNA?

ADD REPLY
0
Entering edit mode

Good point, since I only care about perfect matches to mRNA, ideally refseq.

ADD REPLY
3
Entering edit mode
13.2 years ago

"An afternoon?" Ah ha ha ha! :-) Seriously, it's a bigger, uglier can of worms than you'd expect.

I found the Agilent 4x44k human oligoarray has updated annotation on the GEO platform (April 2011) but not on the Agilent website. You may want to check GEO to see if the annotations are updated sufficiently for your uses before you decide to embark on a potentially perilous journey...

My suggestion is to look at a pipeline designed for this purpose, take a look at a comparison, for example at http://www.biomedcentral.com/1753-6561/3/S4/S1 (there are other reviews, of course, but that one's pretty good.)

I found sigReannot to do pretty well, providing enough extra info that you can spend your time ranking heuristics rather than mapping, then re-mapping, then mapping again (with successively more permissive search spaces.)

Mapping to mRNAs sounds great, and is great for those probes aligning to annotated mRNAs, but there are a ton that don't align to mRNAs, either just downstream an annotated gene, or some are "in the middle of nowhere." I've searched around for updated annotation sets for these types of arrays (Agilent for example) and it's oddly non-existent. I figure the reason is that nobody wants to put potentially incorrect annotations out there.

ADD COMMENT
0
Entering edit mode

"An afternoon" was a little bioinformatics humor, in the same category as "I'll do that in my copious amounts of free time". Thanks for the article; I'll look at sigReannot. It may help me that I'm uninterested in marginal alignments; I only want really clear cases.

ADD REPLY
3
Entering edit mode
13.2 years ago

A simple workflow might look like:

  1. Align to RefSeq using blat
  2. Use pslReps to choose the single best hit
  3. Use a simple perl, python, or even awk script to choose only alignments that meet your criteria.
ADD COMMENT
1
Entering edit mode
13.2 years ago
Eric Fournier ★ 1.4k

I had to do something exceedingly similar with another organism. BLATing probes to the whole set of ResSeq sequences for my organism was the only sensible solution I came up with. Whatever else I tried to do ended up being a huge time sink with only marginal benefits.

Also, be careful about limiting your search space to coding regions. A lot of Agilent probes are designed to hybridize to the 3'UTR of genes, which has the advantage of removing a lot of dT-primer amplification bias.

As an aside, unless you have compelling reasons to do so, do not limit your annotations to alignments with 100% identity on 100% of the length. Agilent 60-mers can have significant hybridization with up to four mismatches, depending on the location of those mismatches within the probe. I personally use a cutoff of 58 matches for annotating probes, with an additional cutoff of 56 matches to determine specificity (IE, if a probe has a 60nt match with one transcript, but a 56nt match with another transcript, I discard the probe as non-specific)

ADD COMMENT
0
Entering edit mode

You're right; I don't mean to exclude the 3-prime UTR. I've corrected the question to indicate that.

ADD REPLY

Login before adding your answer.

Traffic: 1975 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6