Question

How to filter gene counts by keeping genes that have a count of of 10 or more in at least three samples

0

Entering edit mode

22 months ago

BioinfoBee • 0

Hello All, Curious, if anyone is aware of methods to filter gene counts to keep min of >10 or more in at least three samples or replicates using R. I am using rowSums function which usually take into consideration sum of all gene counts in particular row, but this doesn't ensure the the total counts comes from minimum of three samples. Kindly suggest!

Regards, B

filtering counts gene RNA-Seq • 2.2k views

ADD COMMENT • link 22 months ago by BioinfoBee • 0

score 1 · Answer 1 · 2023-07-18

An example data:

df <- data.frame(
  sample1 = c(10, 12, 9, 8, 13, 2, 1),
  sample2 = c(1, 1, 9, 8, 10, 20, 5),
  sample3 = c(1, 11, 9, 18, 11, 2, 2),
  row.names = c("gene_1", "gene_2", "gene_3", "gene_4", "gene_5", "gene_6", "gene_7")
)

Using apply function to create a logical variable that can be used to filter the initial dataset:

keep <- apply(df, 1, function(row) sum(row >= 10) >= 3)
keep
gene_1 gene_2 gene_3 gene_4 gene_5 gene_6 gene_7 
 FALSE  FALSE  FALSE  FALSE   TRUE  FALSE  FALSE

So you can filter the expression matrix to keep only genes with expression value =>10 in => 3 samples , using keep :

filtered_df <- df[keep, ]

score 1 · Answer 2 · 2023-07-18

1

Entering edit mode

22 months ago

bkleiboeker ▴ 370

filtered_counts <- counts_matrix[rowSums(counts_matrix >= 10) >= 3,]

ADD COMMENT • link 22 months ago by bkleiboeker ▴ 370

0

Entering edit mode

bkleiboeker Thank you. This filters rows with minimum of >=10 gene counts in >= 3 sample. is there a way to filter them by three replicates per sample. For example, keeping only rows with min of >10 counts in each of three replicate per sample?

Hamid Ghaedi Thank you.

ADD REPLY • link 22 months ago by BioinfoBee • 0

1

Entering edit mode

I'm not 100% sure I understand your question but would something like this work? This should keep rows which have >10 counts in at least 3 replicates of at least one condition.

df <- data.frame(
  control1 = c(10, 12, 9, 8, 13, 2, 1),
  control2 = c(1, 1, 9, 8, 10, 20, 5),
  control3 = c(1, 11, 9, 18, 11, 2, 2),
  control4 = c(10, 12, 9, 8, 13, 2, 1),
  control5 = c(1, 1, 9, 8, 10, 20, 5),
  control6 = c(1, 11, 9, 18, 11, 2, 2),
  treatment1 = c(10, 12, 9, 8, 13, 2, 1),
  treatment2 = c(1, 1, 9, 8, 10, 20, 5),
  treatment3 = c(1, 11, 9, 18, 11, 2, 2),
  treatment4 = c(1, 1, 9, 8, 10, 20, 5),
  treatment5 = c(1, 11, 9, 18, 11, 2, 2),
  treatment6 = c(1, 11, 9, 18, 11, 2, 2),
  row.names = c("gene_1", "gene_2", "gene_3", "gene_4", "gene_5", "gene_6", "gene_7")
)

keep <- (rowSums(df[ , grepl("control", colnames(df)) ] >= 10) >= 3) | (rowSums(df[ , grepl("treatment", colnames(df)) ] >= 10) >= 3)

filtered_df <- df[keep,]