Hi,
I have a list of genes (long_list
) from which I want to filter a list with genes of interest (goi
) to retrieve a filtered list of genes (genes_filtered
). The goi
s are all expected to be contained in the long_list
.
However, when I filter for the goi
s, I always retrieve four extra genes.
genes_filtered <- long_list[long_list$ensembl %in% goi$ensembl,]
Dimensions of the lists:
dim(long_list)[1]
> 15251
dim(goi)[1]
> 11221
dim(genes_filtered)[1]
> 11225 #Should be 11221 (same as goi)
I have tried the following to get to the bottom if this.
Checking duplicates in genes_filtered
, long_list
, and goi
:
dim(genes_filtered[duplicated(genes_filtered$ensembl),])[1]
> [1] 0
dim(long_list[duplicated(long_list$ensembl),])[1]
> [1] 0
dim(goi[duplicated(goi$ensembl),])[1]
> NULL
Checking missing values:
sum(is.na(genes_filtered))
> [1] 0
sum(is.na(long_list))
> [1] 0
sum(is.na(goi))
> [1] 0
Checking values contained in genes_filtered
but not in goi
:
# 1: Using lists
genes_filtered[!(genes_filtered$ensembl %in% goi$ensembl)]
data frame with 0 columns and 11125 rows
# 2: Extracting columns first from lists
f <- genes_filtered$ensembl
g <- goi$ensembl
g[!(g %in% f)]
[1] "ENSG00000283208" "ENSG00000284292" "ENSG00000262633" ...
Method 2 retrieves a list of in total 96 genes, which is not expected.
Can anyone explain why the filtering method at the top of the post does not work and possibly suggest a correct way?
Very strange, what version of R are you using? Can you posts the two lists for others to compare?
I'm using R 4.1.2
About posting the lists: They are both quite long, what is the best way to post them?
https://pastebin.com/
Sorry, I cannot open that page.
try github then
Tried another network and managed to paste the data.
goi
https://pastebin.com/BGJk7j6i
long_list
https://pastebin.com/hX1nsnNG
I tried another network an managed to paste the genes. They can be found here:
goi
https://pastebin.com/BGJk7j6i
long_list
https://pastebin.com/hX1nsnNG
According to what you provide, I have on my console :
I can't see any problem, are you sure you did not make a copy-paste error ?
Based on the data provided
dim(long_list)[1]
retuns15202
which is not what mentioned in the post. To figure out the issue it is really important to work on the exact same datasets.