Entering edit mode
5.5 years ago
hosin
•
0
Dear Biostar member I have two files(big) and each file has one column of coordinates in genome. I want to find the number of coordinates in second file which including in each coordinate of first file (The result should be two column, the first column include coordinates same as first file and second column include numbers).
First file:
chr22:15273-141831
chr9:19992214-20053813
chr1:220511845-220946924
chr6:51386116-51758466
chr8:64017612-64288853
chr5:7523216-7614366
chr21:49691288-49764730
Second file:
chr22:15273-132511
chr22:140223-141831
chr22:32345-122987
chr9:19992214-20033814
chr9:20012214-20053813
chr1:220511845-220748925
chr1:220615645-220846924
chr1:220615645-220946924
chr6:51386116-51459367
chr6:51386116-51758466
chr8:64017612-64177753
chr8:64277712-64288853
chr5:7523216-7534366
chr5:7544217-7554469
chr5:7554619-7554963
chr5:7600000-7614366
chr21:49691288-49764730
The result should be like:
chr22:15273-141831 3
chr9:19992214-20053813 2
chr1:220511845-220946924 3
chr6:51386116-51758466 2
chr8:64017612-64288853 2
chr5:7523216-7614366 4
chr21:49691288-49764730 1
Is there an easy way to solution in linux(shell)? Thanks
Convert these files to the BED format, e.g. using
awk
(essentially it is simply a replacement of:
and-
by\t
and subtraction of the start coordinate by 1, see the BED format specifications why that is) and then usebedtools intersect
. Have a look at the counting (-c
) option ofintersect
. Please try it out and come back in case of problems.