I have many files (~50) with the same format of four columns: chr, strand, coordinate, score
Each file has about 1 Million lines. Now I want to merge these files by coordinates (strand specific) while preserve the score value of each sample. So the final file will be: chr, strand, coord, scoreTotal, score1, score2, ...., scoreN
For example:
File001:
chr strand coord score
chr1 + 100 1
chr1 - 200 2
File002:
chr1 + 100 3
chr1 + 101 4
chr1 - 200 5
chr1 - 201 6
Merged_file001_file001:
chr strand coord scoreSum score1 score2
chr1 + 100 4 1 3
chr1 + 101 4 0 4
chr1 - 200 7 2 5
chr1 - 201 6 0 6
I'm thinking of using R package like dplyr to group by coordinates, but for so many files with so many lines, it may not be efficient. I'm also thinking of converting each file to BED format, and then use tool like multiIntersectBed to merge files, but it seems that it can not preserve the score column.
Is there any tool can do this merging quickly?
Thank you very much!
You can use "bedtools groupby" command. Show some example files, will behelpfulr to get precise answer.
Thank you, but I checked the "bedtools groupby" command, it seems that it is not suitable for my case.
Hello,
are the coordinates in every file the same or do they differ?
fin swimmer
I have added example tables. Coordinates in every file are not exactly the same.
50 million records isn't that big a data set. How many zeros are you expecting once you
spread()
you data into a matrix?There will not be many zeroes. Because coordinates in different files are largely the same with some difference.