I wonder how can I join multiple large .tsv files (over 20 GB each) by using two identifiers (columns) which I used to do using the full_join
function in dplyr
package for comparatively smaller files in R. All files are in bed format
All files are in bed format, where ref_f.bed will be mapped by the other files by col1 and col2.
#merging the files
dflist <- list(ref_f, file_1, file_2, file_3,....file_n)
Test_out <- dflist %>% reduce(full_join, by=c("chr", "start"))
head ref_f.bed
chr1 1 100 .
chr1 101 200 .
chr1 201 300 .
chr1 301 400 .
chr1 401 500 .
chr1 501 600 .
chr1 601 700 .
chr1 701 800 .
chr1 801 900 .
head file_1.bed
chr1 10101 10200 10158 10177 Scerevisiae-UniPROBE-Tbf1.UP00299
chr1 10101 10200 10158 10177 Scerevisiae-UniPROBE-Tbf1.UP00299
chr1 10101 10200 10164 10183 Scerevisiae-UniPROBE-Tbf1.UP00299
chr1 10101 10200 10164 10183 Scerevisiae-UniPROBE-Tbf1.UP00299
chr1 10101 10200 10175 10194 Scerevisiae-UniPROBE-Tbf1.UP00299
chr1 10101 10200 10175 10194 Scerevisiae-UniPROBE-Tbf1.UP00299
chr1 10101 10200 10175 10194 Scerevisiae-UniPROBE-Tbf1.UP00299
chr1 10101 10200 10181 10200 Scerevisiae-UniPROBE-Tbf1.UP00299
chr1 10101 10200 10181 10200 Scerevisiae-UniPROBE-Tbf1.UP00299
head file_2.bed
chr1 10101 10200 10158 10177 Scerevisiae-UniPROBE-Tbf1.UP00299
chr1 10101 10200 10158 10177 Scerevisiae-UniPROBE-Tbf1.UP00299
chr1 10101 10200 10164 10183 Scerevisiae-UniPROBE-Tbf1.UP00299
chr1 10101 10200 10164 10183 Scerevisiae-UniPROBE-Tbf1.UP00299
chr1 10101 10200 10175 10194 Scerevisiae-UniPROBE-Tbf1.UP00299
chr1 10101 10200 10175 10194 Scerevisiae-UniPROBE-Tbf1.UP00299
chr1 10101 10200 10175 10194 Scerevisiae-UniPROBE-Tbf1.UP00299
chr1 10101 10200 10181 10200 Scerevisiae-UniPROBE-Tbf1.UP00299
chr1 10101 10200 10181 10200 Scerevisiae-UniPROBE-Tbf1.UP00299
head file_3.bed
chr1 29101 29200 29111 29124 Hsapiens-HOCOMOCOv10-AP2D_HUMAN.H10MO.D
chr1 29101 29200 29132 29152 Scerevisiae-UniPROBE-Cha4.UP00300
chr1 29101 29200 29116 29133 Scerevisiae-UniPROBE-Ecm22.UP00305
chr1 29101 29200 29126 29143 Scerevisiae-UniPROBE-Hal9.UP00315
chr1 29101 29200 29126 29143 Scerevisiae-UniPROBE-Hal9.UP00315
chr1 29101 29200 29106 29125 Scerevisiae-UniPROBE-Rpn4.UP00272
chr1 29101 29200 29110 29130 Scerevisiae-UniPROBE-Rsc30.UP00279
chr1 29101 29200 29112 29132 Scerevisiae-UniPROBE-Rsc30.UP00279
chr1 29101 29200 29133 29152 Scerevisiae-UniPROBE-Stp2.UP00338
It would be a great assistance if anyone kindly advised how to overcome this issue.
Thanks
Do not cross-post to multiple fora just because you want a quick response, it annoys people in all communities and in this case, given how big the overlap is between the two communities, it pisses off the same set of people twice.
Cross-post link: https://bioinformatics.stackexchange.com/questions/21196/