Question

Finding the number of coordinates

0

Entering edit mode

6.3 years ago

hosin • 0

Dear Biostar member I have two files(big) and each file has one column of coordinates in genome. I want to find the number of coordinates in second file which including in each coordinate of first file (The result should be two column, the first column include coordinates same as first file and second column include numbers).

First file:
chr22:15273-141831              
chr9:19992214-20053813
chr1:220511845-220946924
chr6:51386116-51758466
chr8:64017612-64288853
chr5:7523216-7614366
chr21:49691288-49764730



Second file: 
chr22:15273-132511
chr22:140223-141831
chr22:32345-122987
chr9:19992214-20033814
chr9:20012214-20053813
chr1:220511845-220748925
chr1:220615645-220846924
chr1:220615645-220946924
chr6:51386116-51459367
chr6:51386116-51758466
chr8:64017612-64177753
chr8:64277712-64288853
chr5:7523216-7534366
chr5:7544217-7554469
chr5:7554619-7554963
chr5:7600000-7614366
chr21:49691288-49764730

The result should be like:

chr22:15273-141831             3
chr9:19992214-20053813         2
chr1:220511845-220946924       3
chr6:51386116-51758466          2
chr8:64017612-64288853          2
chr5:7523216-7614366            4
chr21:49691288-49764730         1

Is there an easy way to solution in linux(shell)? Thanks

genome • 1.1k views

ADD COMMENT • link 6.3 years ago by hosin • 0

0

Entering edit mode

Convert these files to the BED format, e.g. using awk (essentially it is simply a replacement of : and - by \t and subtraction of the start coordinate by 1, see the BED format specifications why that is) and then use bedtools intersect. Have a look at the counting (-c) option of intersect. Please try it out and come back in case of problems.

ADD REPLY • link 6.3 years ago by ATpoint 89k