How to filter gene counts by keeping genes that have a count of of 10 or more in at least three samples
2
0
Entering edit mode
16 months ago
BioinfoBee • 0

Hello All, Curious, if anyone is aware of methods to filter gene counts to keep min of >10 or more in at least three samples or replicates using R. I am using rowSums function which usually take into consideration sum of all gene counts in particular row, but this doesn't ensure the the total counts comes from minimum of three samples. Kindly suggest!

Regards, B

filtering counts gene RNA-Seq • 1.7k views
ADD COMMENT
1
Entering edit mode
16 months ago

An example data:

df <- data.frame(
  sample1 = c(10, 12, 9, 8, 13, 2, 1),
  sample2 = c(1, 1, 9, 8, 10, 20, 5),
  sample3 = c(1, 11, 9, 18, 11, 2, 2),
  row.names = c("gene_1", "gene_2", "gene_3", "gene_4", "gene_5", "gene_6", "gene_7")
)

Using apply function to create a logical variable that can be used to filter the initial dataset:

keep <- apply(df, 1, function(row) sum(row >= 10) >= 3)
keep
gene_1 gene_2 gene_3 gene_4 gene_5 gene_6 gene_7 
 FALSE  FALSE  FALSE  FALSE   TRUE  FALSE  FALSE 

So you can filter the expression matrix to keep only genes with expression value =>10 in => 3 samples , using keep :

filtered_df <- df[keep, ]
ADD COMMENT
1
Entering edit mode
16 months ago
bkleiboeker ▴ 370
filtered_counts <- counts_matrix[rowSums(counts_matrix >= 10) >= 3,]
ADD COMMENT
0
Entering edit mode

bkleiboeker Thank you. This filters rows with minimum of >=10 gene counts in >= 3 sample. is there a way to filter them by three replicates per sample. For example, keeping only rows with min of >10 counts in each of three replicate per sample?

Hamid Ghaedi Thank you.

ADD REPLY
1
Entering edit mode

I'm not 100% sure I understand your question but would something like this work? This should keep rows which have >10 counts in at least 3 replicates of at least one condition.

df <- data.frame(
  control1 = c(10, 12, 9, 8, 13, 2, 1),
  control2 = c(1, 1, 9, 8, 10, 20, 5),
  control3 = c(1, 11, 9, 18, 11, 2, 2),
  control4 = c(10, 12, 9, 8, 13, 2, 1),
  control5 = c(1, 1, 9, 8, 10, 20, 5),
  control6 = c(1, 11, 9, 18, 11, 2, 2),
  treatment1 = c(10, 12, 9, 8, 13, 2, 1),
  treatment2 = c(1, 1, 9, 8, 10, 20, 5),
  treatment3 = c(1, 11, 9, 18, 11, 2, 2),
  treatment4 = c(1, 1, 9, 8, 10, 20, 5),
  treatment5 = c(1, 11, 9, 18, 11, 2, 2),
  treatment6 = c(1, 11, 9, 18, 11, 2, 2),
  row.names = c("gene_1", "gene_2", "gene_3", "gene_4", "gene_5", "gene_6", "gene_7")
)

keep <- (rowSums(df[ , grepl("control", colnames(df)) ] >= 10) >= 3) | (rowSums(df[ , grepl("treatment", colnames(df)) ] >= 10) >= 3)

filtered_df <- df[keep,]
ADD REPLY
0
Entering edit mode

This worked. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 2989 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6