Filtering genes gives incorrect number of genes in filtered list
1
1
Entering edit mode
3.0 years ago
bioneer ▴ 40

Hi,

I have a list of genes (long_list) from which I want to filter a list with genes of interest (goi) to retrieve a filtered list of genes (genes_filtered). The gois are all expected to be contained in the long_list.

However, when I filter for the gois, I always retrieve four extra genes.

genes_filtered <- long_list[long_list$ensembl %in% goi$ensembl,]

Dimensions of the lists:

dim(long_list)[1]
> 15251
dim(goi)[1]
> 11221
dim(genes_filtered)[1]
> 11225 #Should be 11221 (same as goi)

I have tried the following to get to the bottom if this.

Checking duplicates in genes_filtered, long_list, and goi:

dim(genes_filtered[duplicated(genes_filtered$ensembl),])[1]
> [1] 0
dim(long_list[duplicated(long_list$ensembl),])[1]
> [1] 0
dim(goi[duplicated(goi$ensembl),])[1]
> NULL

Checking missing values:

sum(is.na(genes_filtered))
> [1] 0
sum(is.na(long_list))
> [1] 0
sum(is.na(goi))
> [1] 0

Checking values contained in genes_filtered but not in goi:

# 1: Using lists
genes_filtered[!(genes_filtered$ensembl %in% goi$ensembl)]
data frame with 0 columns and 11125 rows

# 2: Extracting columns first from lists
f <- genes_filtered$ensembl
g <- goi$ensembl

g[!(g %in% f)]
[1] "ENSG00000283208" "ENSG00000284292" "ENSG00000262633" ...

Method 2 retrieves a list of in total 96 genes, which is not expected.

Can anyone explain why the filtering method at the top of the post does not work and possibly suggest a correct way?

filtering genes R • 2.2k views
ADD COMMENT
0
Entering edit mode

Very strange, what version of R are you using? Can you posts the two lists for others to compare?

ADD REPLY
0
Entering edit mode

I'm using R 4.1.2

R version 4.1.2 (2021-11-01)

About posting the lists: They are both quite long, what is the best way to post them?

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Sorry, I cannot open that page.

ADD REPLY
0
Entering edit mode

try github then

ADD REPLY
0
Entering edit mode

Tried another network and managed to paste the data.

goi

https://pastebin.com/BGJk7j6i

long_list

https://pastebin.com/hX1nsnNG

ADD REPLY
0
Entering edit mode

I tried another network an managed to paste the genes. They can be found here:

goi

https://pastebin.com/BGJk7j6i

long_list

https://pastebin.com/hX1nsnNG

ADD REPLY
0
Entering edit mode

According to what you provide, I have on my console :

dim(genes_filtered)[1]
[1] 11125

I can't see any problem, are you sure you did not make a copy-paste error ?

ADD REPLY
0
Entering edit mode

Based on the data provided dim(long_list)[1] retuns 15202 which is not what mentioned in the post. To figure out the issue it is really important to work on the exact same datasets.

ADD REPLY
2
Entering edit mode
3.0 years ago
Mark ★ 1.6k

I can't believe you and I missed this. It took me a while to workout what the issue was:

length(ll) # 15202
length(goi) # 11221
sum(ll %in% goi) # 11125

In case you missed it:

11221
11125

The difference? 96. You said:

Method 2 retrieves a list of in total 96 genes, which is not expected.

Not sure why we discounted this, but method 2 is correct. There's a difference of 96 genes between goi and long_list. Here's another method that shows this:

library(dplyr)
goi_df = data.frame(id = goi,
                    goi_id = 1:length(goi))
ll_df = data.frame(id = ll,
                    ll_id = 1:length(ll))
combine = left_join(ll_df, goi_df, by = "id")

sum(is.na(combine$goi_id)) # 4077

15202 - 4077 # 11125

So which genes of interest are missing?

setdiff(goi, combined_subset$id)

[1] "ENSG00000283208" "ENSG00000284292" "ENSG00000262633" "ENSG00000258472" "ENSG00000276017" "ENSG00000283809" "ENSG00000277971" "ENSG00000277639" [9] "ENSG00000269755" "ENSG00000268400" "ENSG00000258311" "ENSG00000261341" "ENSG00000268750" "ENSG00000279765" "ENSG00000226690" "ENSG00000284526" [17] "ENSG00000267740" "ENSG00000261884" "ENSG00000267426" "ENSG00000170846" "ENSG00000283580" "ENSG00000267314" "ENSG00000283782" "ENSG00000283761" [25] "ENSG00000214265" "ENSG00000183889" "ENSG00000262165" "ENSG00000275063" "ENSG00000277856" "ENSG00000182584" "ENSG00000267120" "ENSG00000259529" [33] "ENSG00000258465" "ENSG00000203546" "ENSG00000282246" "ENSG00000187186" "ENSG00000278384" "ENSG00000124593" "ENSG00000273748" "ENSG00000277263" [41] "ENSG00000285085" "ENSG00000256591" "ENSG00000173915" "ENSG00000156411" "ENSG00000214654" "ENSG00000284976" "ENSG00000151131" "ENSG00000130921" [49] "ENSG00000174206" "ENSG00000125149" "ENSG00000088854" "ENSG00000285217" "ENSG00000171159" "ENSG00000104957" "ENSG00000198003" "ENSG00000160124" [57] "ENSG00000018610" "ENSG00000187866" "ENSG00000156504" "ENSG00000156500" "ENSG00000051009" "ENSG00000151553" "ENSG00000158863" "ENSG00000285382" [65] "ENSG00000275464" "ENSG00000287542" "ENSG00000232593" "ENSG00000162929" "ENSG00000276033" "ENSG00000274847" "ENSG00000072415" "ENSG00000105926" [73] "ENSG00000198899" "ENSG00000228253" "ENSG00000198804" "ENSG00000198712" "ENSG00000198938" "ENSG00000198727" "ENSG00000198888" "ENSG00000198763" [81] "ENSG00000198840" "ENSG00000198886" "ENSG00000212907" "ENSG00000198786" "ENSG00000198695" "ENSG00000258724" "ENSG00000228049" "ENSG00000285437" [89] "ENSG00000143303" "ENSG00000205045" "ENSG00000102125" "ENSG00000285053" "ENSG00000283268" "ENSG00000011638" "ENSG00000243667" "ENSG00000225528"

Tricky tricky. Hope this helps.

ADD COMMENT
1
Entering edit mode

Hi Mark, Thank you so much for digging into the issue and for finding what I missed! The assumption that all genes of interest would be contained in the long list was apparently wrong.

Again, thank you!

ADD REPLY

Login before adding your answer.

Traffic: 2638 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6