I want to merge two data frames (named 'data' and 'add') so that only the rows of the first dataframe(data) are kept (Left Join).
When I use
Merged <- merge(data, add, by.x = "DATAGene", by.y = "ADDGene", all.x = TRUE, all.y = FALSE)
I get more rows than I had in the first data frame.
dim(data)
[1] 21578 4
dim(add)
[1] 25778 2
dim(Merged)
[1] 21639 5`
Why would this happen and is there a way to avoid it?
Both the DATAGene and ADDGene columns are character columns.
The datatable "add" looks like..
ADDGene V1
1 TSPAN6 51
2 TNMD 0
3 DPM1 114
4 SCYL3 9
5 C1orf112 1
...
87 SPPL2B 6
88 FAM214B 20
89 COPZ2 75
. .
The datatable "data" looks like..
DATAGene V2 V3 V4
1 TSPAN6 294 778 595
2 TNMD 0 8 0
3 DPM1 354 311 696
4 SCYL3 86 94 134
5 C1orf112 147 268 263
...
87 FAM214B 415 115 156
88 COPZ2 82 13 12
89 PRKAR2B 1523 710 250
It is highly possible that one of the data.frame has duplicate value of gene name
You are correct, the Data and Merged data tables have the same number of rows if I only count the unique ones. Do you know if there a way to keep only the rows from Data and its duplicates?
I believe removing the duplicates from add using the following command fixed the issue. Thank you