Hi,
I've a big file ( ~ 50M lines ) containing paired genomic positions like this (each line a paired position ):
chrA posA chrB posB
and I want to reduce this list of paired positions by regrouping paired genomic positions that are closed. For example
chr1 1000 chr8 5000
chr1 990 chr8 5030
chr1 1010 chr8 5010
chr5 500 chr10 1000
and after processing it becomes: (the last column represent the number of lines supporting the paired position)
chr1 1000 chr8 5000 3
chr5 500 chr10 1000 1
Any ideas? My first idea was to use a perl script with hash table but I'm a little concern about the size of the list.
FYI : the file is sorted by chrom and positions.
thanks
I don't quite follow are you considering chr1 990, 1000 and 1010 as the same position in some sense?
I know it's not the same position but there are quite close to each other. I might rephrase the question by regrouping positions that are in the same region.