How To Map Chromosome Coordinates To Ncbi Refseq
2
1
Entering edit mode
11.6 years ago
pld 5.1k

I recently used BLAT to generate a large number of alignments but the input files were on the chromosome level, not individual genes. I realize BLAT and filtering the subject by taxonomical ID is an option, but BLAT seems to be a stricter option. The alignments are 60bp with no gaps. What would be the best way to find the gene associated with each of these regions, if it exists?

genbank refseq • 3.5k views
ADD COMMENT
0
Entering edit mode

If you have the coordinates of the alignment maybe Intersectbed could help.

ADD REPLY
2
Entering edit mode
11.6 years ago
Duarte Molha ▴ 240

As @AndreiR as said...

Download a bed file of all the genes of interest (from the ucsc table browser you can download ready-made bed files)

then make a bed file from your alignments:

then install bedtools and:

intersectBed -a alignments.bed -b refseq.bed -wo > intersections.txt

This should give you all the info you need.

ADD COMMENT
0
Entering edit mode
11.6 years ago

One way to do this is to use BEDOPS psl2bed to convert UCSC BLAT output to UCSC BED, then use BEDOPS bedmap to list BED-formatted gene annotations associated with the BED coordinates of your BLAT search results.

For example, let's say you are working with human data and you want GENCODE v16 annotations. Here's how you would put them into sorted BED format with BEDOPS gtf2bed:

$ wget -O - ftp://ftp.sanger.ac.uk/pub/gencode/release_16/gencode.v16.annotation.gtf.gz \
    | gunzip -c - \
    | gtf2bed - \
    | grep -w gene - \
    > gencode.v16.genes.bed

Then, you could build a list of unique gene IDs that map to each BLAT search result with the following bedmap command. Let's assume your BLAT results are in a file called searchResults.psl:

$ psl2bed < searchResults.psl \
    | bedmap --echo --echo-map-id-uniq - gencode.v16.genes.bed \
    > answer.viaBedmap.bed

The file answer.viaBedmap.bed is a sorted BED file with results in the following layout:

[ BLAT result 1 ] | [ gene 1-1 ] ; [ gene 1-2 ] ; ... ; [ gene 1-i ]
...
[ BLAT result N ] | [ gene N-1 ] ; [ gene N-2 ] ; ... ; [ gene N-j ]

Each [ BLAT result n ] is a BED element from the BLAT result. The pipe character delimits this element from a semi-colon-delimited list of unique GENCODE gene IDs (gene_name fields from the original GTF) that map to the BLAT element.

If you want the entire gene element, replace --echo-map-id-uniq with --echo-map. There are other --echo-map-* operators available; the documentation explains their usage.

The default mapping criterion is one or more bases of overlap between the BLAT and GENCODE elements. You can change this criteria to be more stringent with the appropriate bedmap option.

If you don't need to know how the gene IDs associate with each BLAT result — you just want the gene IDs that overlap the BLAT result population — a second way to do this is to instead use BEDOPS bedops --element-of:

$ psl2bed < searchResults.psl \
    | bedops --element-of -1 gencode.v16.genes.bed - \
    > answer.viaBedops.bed

This will just return BED-formatted GENCODE gene elements that overlap the BLAT coordinates by one or more bases. You don't get the per-element association information with a bedops-based approach, but it will run faster than bedmap.

Which tool to use between bedmap and bedops will likely depend on what question you are trying to answer.

Also, the answer.*.bed files calculated with either approach are sorted, which means that they are ready to use with other downstream BEDOPS-based analyses. This means that you can do more fast and memory efficient set or mapping operations with this result and BEDOPS tools, and no further sorting is necessary.

ADD COMMENT

Login before adding your answer.

Traffic: 1621 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6