Mapping chromosome location to gene ID
2
0
Entering edit mode
2.5 years ago
asente ▴ 30

Dear All,

I downloaded a set of bed files containing eCLIP data from the Encode project. The file looks like this:

chr7    77166811    77166847    CSTF2T_K562_rep02   1000    +   5.28146422361672    46.6734054831252    -1  -1
chr7    106809937   106810000   CSTF2T_K562_rep02   1000    +   6.14290339308374    44.838829122336 -1  -1
chr7    64499317    64499409    CSTF2T_K562_rep02   1000    +   6.05907180228027    41.9978265920308    -1  -1
chr7    77166847    77166915    CSTF2T_K562_rep02   1000    +   5.00034936315757    39.3525779060187    -1  -1
chr7    77166598    77166662    CSTF2T_K562_rep02   1000    +   4.75856269102781    32.8986654267798    -1  -1
chr7    158703710   158703824   CSTF2T_K562_rep02   1000    +   4.82533170324052    32.8087161400284    -1  -1

I am interested to find the gene ID (e.g., ensembl ENSG, ENST or Uniprot ID) for each of the chromosome locations (defined with start and stop positions in columns 2 and 3). I know how to achieve this using R biomaRt package but this seems quite slow when I have millions of entries to map. Does anyone have a suggestion for a better and faster solution?

Many thanks for all your suggestions.

With best wishes, Andrija

bedtools eclip ensembl bed R • 1.3k views
ADD COMMENT
2
Entering edit mode
2.5 years ago

Given files eCLIP.bed and genes.bed containing regions of interest and genes, resp.:

bedmap --echo --echo-map-id --delim '\t' eCLIP.bed genes.bed > answer.bed

The list of overlapping gene IDs will be in the last column of answer.bed.

To get genes, you could use Gencode:

wget -qO- https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_40/gencode.v40.annotation.gff3.gz \
    | gunzip --stdout - \
    | awk '$3 == "gene"' - \
    | convert2bed -i gff - \
    > genes.bed

Gencode follows UCSC chromosome naming convention, so you can use genes.bed directly for bedmap operations.

Refs.: 1) https://bedops.readthedocs.io/en/latest/content/reference/statistics/bedmap.html 2) https://bedops.readthedocs.io/en/latest/content/reference/file-management/conversion/convert2bed.html#

ADD COMMENT
1
Entering edit mode

Dear Alex,

Thank you for your help. Your answer and the functions are brilliant!

Best wishes, Andrija

ADD REPLY
2
Entering edit mode
2.5 years ago

download all the genes positions and get the intersection with bedtools intersect

ADD COMMENT

Login before adding your answer.

Traffic: 1809 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6