I'm sure that there is a clever way to do this using MySQL queries to the UCSC genome database tables, probably in conjunction with Galaxy. Here's my less-clever way:
- Follow the instructions in this answer to obtain a BED file with the introns; we'll call it introns.txt
- Go to the Affymetrix page for exon arrays and login (create account if required)
- Scroll down to "Archived NetAffx Annotation Files" and download the appropriate file for HG18; it's named HuEx-1_0-st-v2.na29.hg18.transcript.csv.zip
Next: unzip the exon annotation file and extract just the relevant fields:
unzip HuEx-1_0-st-v2.na29.hg18.transcript.csv.zip
grep -v "^#" HuEx-1_0-st-v2.na29.hg18.probeset.csv |
awk 'BEGIN {FS=","; OFS=","} {print $1,$2,$3,$4,$5,$6,$7}' > huex.csv
head -3 huex.csv # first couple of lines
"probeset_id","seqname","strand","start","stop","probe_count","transcript_cluster_id"
"2315101","chr1","+","1788","2030","4","2315100"
"2315102","chr1","+","2520","2555","4","2315100"
Now, read those files into R and find overlaps using the GenomicRanges package.
library(GenomicRanges)
introns <- read.table("introns.txt", header = F, stringsAsFactors = F)
colnames(introns) <- c("chr", "start", "end", "name", "score", "strand")
huex <- read.table("huex.csv", sep = ",", header = T, stringsAsFactors = F)
# make GRanges objects
introns.gr <- GRanges(seqnames = Rle(introns$chr),
ranges = IRanges(start = introns$start,
end = introns$end,
names = introns$name),
strand = Rle(introns$strand))
huex.gr <- GRanges(seqnames = Rle(huex$seqname),
ranges = IRanges(start = huex$start,
end = huex$stop,
names = huex$probeset_id),
strand = Rle(huex$strand))
# find probesets wholly within introns
ov <- findOverlapshuex.gr, introns.gr, type = "within")
The matchMatrix() method returns the corresponding rows for query (probesets) and subject (introns). Here's what I got back for chromosome 21:
head(matchMatrix(ov))
query subject
# [1,] 802861 140
# [2,] 802866 144
# [3,] 802867 144
# [4,] 802870 142
# [5,] 802870 145
# [6,] 802871 142
And you can confirm that selected probesets are within introns:
c(introns[140,]$start, introns[140,]$end)
# [1] 13332616 13336693
c(huex[802861,]$start, huex[802861,]$stop)
# [1] 13332617 13332722
Assuming you mean introns defined by UCSC HG18 genes.