This answer assumes reference genome hg19
. Adjust for your work, as needed. This answer also assumes you have installed the BEDOPS toolkit, including sort-bed
and bedmap
.
First, generate a BED file of HGNC names and genomic positions:
$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -N -e "SELECT k.chrom, kg.txStart, kg.txEnd, x.geneSymbol FROM knownCanonical k, knownGene kg, kgXref x WHERE k.transcript = x.kgID AND k.transcript = kg.name" hg19 | sort-bed - > genes.bed
For example:
$ grep CDK11B genes.bed
chr1 1571099 1655775 CDK11B
Next, generate a BED file of cytobands:
$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz | gunzip -c | sort-bed - > cytoBand.bed
Finally, map HGNC intervals to cytobands with BEDOPS bedmap --echo-map-id
:
$ bedmap --echo --echo-map-id --delim '\t' genes.bed cytoBand.bed > answer.bed
The file answer.bed
gives a mapping of gene names to cytobands:
$ head answer.bed
chr1 11873 14409 DDX11L1 p36.33
chr1 14361 19759 WASH7P p36.33
chr1 14406 29370 WASH7P p36.33
chr1 34610 36081 FAM138F p36.33
chr1 69090 70008 OR4F5 p36.33
chr1 134772 140566 LOC729737 p36.33
chr1 321083 321115 DQ597235 p36.33
chr1 321145 321207 DQ599768 p36.33
chr1 322036 326938 LOC100133331 p36.33
chr1 327545 328439 LOC388312 p36.33
To return to your example:
$ grep CDK11B answer.bed
chr1 1571099 1655775 CDK11B p36.33
If you want an answer formatted like 1p36.33
, you can awk
the chromosome name (chr1
) and the fifth field (p36.33
) to build the answer as you need it.
$ awk '{ gsub("^chr*", "", $1); print $4"\t"$1$5; }' answer.bed > answer.txt
Then:
$ grep CDK11B answer.txt
CDK11B 1p36.33
If you have a text file of 400 gene names (or whatever), you can grep
this file with the -f
option:
$ grep -f geneNames.txt answer.txt > filteredAnswer.txt