One way to do this is to use BEDOPS psl2bed to convert UCSC BLAT output to UCSC BED, then use BEDOPS bedmap to list BED-formatted gene annotations associated with the BED coordinates of your BLAT search results.
For example, let's say you are working with human data and you want GENCODE v16 annotations. Here's how you would put them into sorted BED format with BEDOPS gtf2bed:
$ wget -O - ftp://ftp.sanger.ac.uk/pub/gencode/release_16/gencode.v16.annotation.gtf.gz \
| gunzip -c - \
| gtf2bed - \
| grep -w gene - \
> gencode.v16.genes.bed
Then, you could build a list of unique gene IDs that map to each BLAT search result with the following bedmap
command. Let's assume your BLAT results are in a file called searchResults.psl
:
$ psl2bed < searchResults.psl \
| bedmap --echo --echo-map-id-uniq - gencode.v16.genes.bed \
> answer.viaBedmap.bed
The file answer.viaBedmap.bed
is a sorted BED file with results in the following layout:
[ BLAT result 1 ] | [ gene 1-1 ] ; [ gene 1-2 ] ; ... ; [ gene 1-i ]
...
[ BLAT result N ] | [ gene N-1 ] ; [ gene N-2 ] ; ... ; [ gene N-j ]
Each [ BLAT result n ]
is a BED element from the BLAT result. The pipe character delimits this element from a semi-colon-delimited list of unique GENCODE gene IDs (gene_name
fields from the original GTF) that map to the BLAT element.
If you want the entire gene element, replace --echo-map-id-uniq
with --echo-map
. There are other --echo-map-*
operators available; the documentation explains their usage.
The default mapping criterion is one or more bases of overlap between the BLAT and GENCODE elements. You can change this criteria to be more stringent with the appropriate bedmap
option.
If you don't need to know how the gene IDs associate with each BLAT result — you just want the gene IDs that overlap the BLAT result population — a second way to do this is to instead use BEDOPS bedops --element-of
:
$ psl2bed < searchResults.psl \
| bedops --element-of -1 gencode.v16.genes.bed - \
> answer.viaBedops.bed
This will just return BED-formatted GENCODE gene elements that overlap the BLAT coordinates by one or more bases. You don't get the per-element association information with a bedops
-based approach, but it will run faster than bedmap
.
Which tool to use between bedmap
and bedops
will likely depend on what question you are trying to answer.
Also, the answer.*.bed
files calculated with either approach are sorted, which means that they are ready to use with other downstream BEDOPS-based analyses. This means that you can do more fast and memory efficient set or mapping operations with this result and BEDOPS tools, and no further sorting is necessary.
If you have the coordinates of the alignment maybe Intersectbed could help.