Question

Mapping chromosome location to gene ID

0

Entering edit mode

3.1 years ago

asente ▴ 30

Dear All,

I downloaded a set of bed files containing eCLIP data from the Encode project. The file looks like this:

chr7    77166811    77166847    CSTF2T_K562_rep02   1000    +   5.28146422361672    46.6734054831252    -1  -1
chr7    106809937   106810000   CSTF2T_K562_rep02   1000    +   6.14290339308374    44.838829122336 -1  -1
chr7    64499317    64499409    CSTF2T_K562_rep02   1000    +   6.05907180228027    41.9978265920308    -1  -1
chr7    77166847    77166915    CSTF2T_K562_rep02   1000    +   5.00034936315757    39.3525779060187    -1  -1
chr7    77166598    77166662    CSTF2T_K562_rep02   1000    +   4.75856269102781    32.8986654267798    -1  -1
chr7    158703710   158703824   CSTF2T_K562_rep02   1000    +   4.82533170324052    32.8087161400284    -1  -1

I am interested to find the gene ID (e.g., ensembl ENSG, ENST or Uniprot ID) for each of the chromosome locations (defined with start and stop positions in columns 2 and 3). I know how to achieve this using R biomaRt package but this seems quite slow when I have millions of entries to map. Does anyone have a suggestion for a better and faster solution?

Many thanks for all your suggestions.

With best wishes, Andrija

bedtools eclip ensembl bed R • 1.7k views

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 3.1 years ago by asente ▴ 30

2

Entering edit mode

3.1 years ago

Pierre Lindenbaum 166k

download all the genes positions and get the intersection with bedtools intersect

ADD COMMENT • link 3.1 years ago by Pierre Lindenbaum 166k

score 2 · Accepted Answer · 2022-06-10

Given files eCLIP.bed and genes.bed containing regions of interest and genes, resp.:

bedmap --echo --echo-map-id --delim '\t' eCLIP.bed genes.bed > answer.bed

The list of overlapping gene IDs will be in the last column of answer.bed.

To get genes, you could use Gencode:

wget -qO- https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.annotation.gff3.gz \
    | gunzip --stdout - \
    | awk '$3 == "gene"' - \
    | convert2bed -i gff - \
    > genes.bed

Gencode follows UCSC chromosome naming convention, so you can use genes.bed directly for bedmap operations.

Refs.: 1) https://bedops.readthedocs.io/en/latest/content/reference/statistics/bedmap.html 2) https://bedops.readthedocs.io/en/latest/content/reference/file-management/conversion/convert2bed.html#