Dear all,
I have a list of genome positions, like this (thousands of ):
chr1 7860756 7973866
chr1 7973763 7976860
chr3 8155768 8156554
chr5 8729338 8729482
chr5 8846800 8860030
chr6 8848105 8848254
chr6 11904306 11904806
chr7 11904687 11904840
chr8 11904718 11904836
chr10 11904792 11904841
I have a genome annotation file like this (repetitive region) 7 Million lines:
chr1 rn6_rmsk exon 471 533 32.000000 + . gene_id "(ATAC)n"; transcript_id "(ATAC)n";
chr1 rn6_simpleRepeat exon 480 510 53.000000 . . gene_id "trf"; transcript_id "trf";
chr1 rn6_rmsk exon 717 862 588.000000 + . gene_id "RMER10B"; transcript_id "RMER10B";
chr1 rn6_rmsk exon 2099 2452 627.000000 - . gene_id "MTD"; transcript_id "MTD_dup9";
chr1 rn6_rmsk exon 2902 2936 15.000000 + . gene_id "(CTTC)n"; transcript_id "(CTTC)n";
chr1 rn6_rmsk exon 3227 3251 26.000000 + . gene_id "(TGTC)n"; transcript_id "(TGTC)n";
chr1 rn6_simpleRepeat exon 3227 3251 50.000000 . . gene_id "trf"; transcript_id "trf_dup1";
chr1 rn6_rmsk exon 3252 3288 13.000000 + . gene_id "(ATCAC)n"; transcript_id "(ATCAC)n";
chr1 rn6_simpleRepeat exon 3500 3538 53.000000 . . gene_id "trf"; transcript_id "trf_dup2";
chr1 rn6_rmsk exon 3500 3539 25.000000 + . gene_id "(AC)n"; transcript_id "(AC)n_dup5";
chr1 rn6_rmsk exon 3539 3599 46.000000 + . gene_id "(AG)n"; transcript_id "(AG)n";
chr1 rn6_simpleRepeat exon 3539 3599 88.000000 . . gene_id "trf"; transcript_id "trf_dup3";
chr1 rn6_rmsk exon 3602 4047 1096.000000 + . gene_id "RLTR20B1"; transcript_id
Basically I want to figure out whether region in the first file is in the repetitive regions denoted in the second files. I don't want to loop through the second file every time, is there any efficient ways to do this?
I know there is existing script to do this, but want to write this program my self.
Could anyone just give me a hint a algorithm or something?
Best,
Jun
you don't need the loop to do this. that's what intersect does by default
Thanks! I edited my answer
Thank both of you!