Dear All,
I downloaded a set of bed files containing eCLIP data from the Encode project. The file looks like this:
chr7 77166811 77166847 CSTF2T_K562_rep02 1000 + 5.28146422361672 46.6734054831252 -1 -1
chr7 106809937 106810000 CSTF2T_K562_rep02 1000 + 6.14290339308374 44.838829122336 -1 -1
chr7 64499317 64499409 CSTF2T_K562_rep02 1000 + 6.05907180228027 41.9978265920308 -1 -1
chr7 77166847 77166915 CSTF2T_K562_rep02 1000 + 5.00034936315757 39.3525779060187 -1 -1
chr7 77166598 77166662 CSTF2T_K562_rep02 1000 + 4.75856269102781 32.8986654267798 -1 -1
chr7 158703710 158703824 CSTF2T_K562_rep02 1000 + 4.82533170324052 32.8087161400284 -1 -1
I am interested to find the gene ID (e.g., ensembl ENSG, ENST or Uniprot ID) for each of the chromosome locations (defined with start and stop positions in columns 2 and 3). I know how to achieve this using R biomaRt package but this seems quite slow when I have millions of entries to map. Does anyone have a suggestion for a better and faster solution?
Many thanks for all your suggestions.
With best wishes, Andrija
Dear Alex,
Thank you for your help. Your answer and the functions are brilliant!
Best wishes, Andrija