Efficient way to add annotation to a list of genome positions with python
2
1
Entering edit mode
10.2 years ago
juncheng ▴ 220

Dear all,

I have a list of genome positions, like this (thousands of ):

chr1 7860756 7973866
chr1 7973763 7976860
chr3 8155768 8156554
chr5 8729338 8729482
chr5 8846800 8860030
chr6 8848105 8848254
chr6 11904306 11904806
chr7 11904687 11904840
chr8 11904718 11904836
chr10 11904792 11904841

I have a genome annotation file like this (repetitive region) 7 Million lines:

chr1    rn6_rmsk        exon    471     533     32.000000       +       .       gene_id "(ATAC)n"; transcript_id "(ATAC)n";
chr1    rn6_simpleRepeat        exon    480     510     53.000000       .       .       gene_id "trf"; transcript_id "trf";
chr1    rn6_rmsk        exon    717     862     588.000000      +       .       gene_id "RMER10B"; transcript_id "RMER10B";
chr1    rn6_rmsk        exon    2099    2452    627.000000      -       .       gene_id "MTD"; transcript_id "MTD_dup9";
chr1    rn6_rmsk        exon    2902    2936    15.000000       +       .       gene_id "(CTTC)n"; transcript_id "(CTTC)n";
chr1    rn6_rmsk        exon    3227    3251    26.000000       +       .       gene_id "(TGTC)n"; transcript_id "(TGTC)n";
chr1    rn6_simpleRepeat        exon    3227    3251    50.000000       .       .       gene_id "trf"; transcript_id "trf_dup1";
chr1    rn6_rmsk        exon    3252    3288    13.000000       +       .       gene_id "(ATCAC)n"; transcript_id "(ATCAC)n";
chr1    rn6_simpleRepeat        exon    3500    3538    53.000000       .       .       gene_id "trf"; transcript_id "trf_dup2";
chr1    rn6_rmsk        exon    3500    3539    25.000000       +       .       gene_id "(AC)n"; transcript_id "(AC)n_dup5";
chr1    rn6_rmsk        exon    3539    3599    46.000000       +       .       gene_id "(AG)n"; transcript_id "(AG)n";
chr1    rn6_simpleRepeat        exon    3539    3599    88.000000       .       .       gene_id "trf"; transcript_id "trf_dup3";
chr1    rn6_rmsk        exon    3602    4047    1096.000000     +       .       gene_id "RLTR20B1"; transcript_id

Basically I want to figure out whether region in the first file is in the repetitive regions denoted in the second files. I don't want to loop through the second file every time, is there any efficient ways to do this?

I know there is existing script to do this, but want to write this program my self.

Could anyone just give me a hint a algorithm or something?

Best,
Jun

genome python • 2.7k views
ADD COMMENT
3
Entering edit mode
10.2 years ago

If you want to do it within python, I think pybedtools is what you need. Example, untested: Print the intervals in a with an overlap in b:

import pybedtools

a= pybedtools.BedTool('a.bed')
b= pybedtools.BedTool('b.gff')
print(a.intersect(b, u= True))
  
ADD COMMENT
1
Entering edit mode

you don't need the loop to do this. that's what intersect does by default

ADD REPLY
0
Entering edit mode

Thanks! I edited my answer

ADD REPLY
0
Entering edit mode

Thank both of you!

ADD REPLY
1
Entering edit mode
10.2 years ago
lelle ▴ 830

If you want to implement this yourself, I think an intervall tree is the data structure of choice. Here is a blog post with an implementation and some discussion that I found rather useful.

ADD COMMENT
0
Entering edit mode

Thanks, nice!

ADD REPLY

Login before adding your answer.

Traffic: 1009 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6