filtering dge data with cpm
1
0
Entering edit mode
2.6 years ago

Hi I have little bit R problem when I am trying to filtering my DGE data taken from edgeR. My trial data is like below:

     SRR13826419_counts SRR13826420_counts SRR13826421_counts SRR13826422_counts SRR13826423_counts
ENSG00000000003.15   55.4289230485435   48.6042625938295   50.0892015765418   39.9978650657232    45.745056983312
ENSG00000000005.6                   0                  0                  0                  0                  0
ENSG00000000419.14   36.8048049042329   36.6032101015259    35.981799866693   32.9242876425245   39.1630343957851
ENSG00000000457.14     6.799281227288   4.50039468461384   5.54785460499672   6.17330393297335     6.033520705233
ENSG00000000460.17   19.6587913745501   25.8022628584527   20.6063171042735   18.7771327961273   21.0624722800861
ENSG00000000938.13                  0                  0                  0                  0                  0
                   SRR13826424_counts
ENSG00000000003.15   38.0779604481818
ENSG00000000005.6                   0
ENSG00000000419.14   42.1830193812572
ENSG00000000457.14    5.5205964962048
ENSG00000000460.17   14.7215906565461
ENSG00000000938.13                  0

I would like to filter out genes with expression greater than 1 (in CPM) in more than 50% of samples in at least 1 sample group.

My samples are in 2 conditions and they are triplicate.

CPM_matrix <- read.delim("~/Desktop/CPM_matrix.txt", row.names = "gene_id")
labels <- names(CPM_matrix)
CPM_matrix[labels] <- lapply(CPM_matrix[labels] , factor)

CPM_try <- head(CPM_matrix)
group <- factor(c(1,1,1,2,2,2))
countMatrix_groups <- data.frame(label=colnames(CPM_matrix),group)

aa <- noquote(countMatrix_groups[countMatrix_groups$group == '1',]$label[1])
bb <- noquote(countMatrix_groups[countMatrix_groups$group == '1',]$label[2])
cc <- noquote(countMatrix_groups[countMatrix_groups$group == '1',]$label[3])
dd <- noquote(countMatrix_groups[countMatrix_groups$group == '2',]$label[1])
ee <- noquote(countMatrix_groups[countMatrix_groups$group == '2',]$label[2])
ff <- noquote(countMatrix_groups[countMatrix_groups$group == '2',]$label[3])


for (i in 1:nrow(CPM_try)) {

aaa <-  filter(CPM_try, aa < 1 & bb < 1 | aa < 1 & cc < 1 | bb < 1 & cc < 1 | dd < 1 & ee < 1 | dd < 1 & ff < 1 | ee < 1 & ff < 1 )
}

When I apply this code it gives me nothing, there is sth wrong but I couldnt figure out thanks in advance.

expression r gene rna-seq differential • 729 views
ADD COMMENT
1
Entering edit mode
2.6 years ago
jv ★ 1.8k

In my experience it's not worth performing such a specific filtering step, as opposed to a more general filter to remove consistently low or non expressed gene before doing differential expression analysis - that's of course assuming that that is your next step here. For some context see: https://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#pre-filtering

To do the filtering you've described here I suggest a tidyverse/dplyr approach in which you convert your counts matrix to long-format and use group_by to calculate sample group metrics using summarise for which the following resources will be helpful:

https://tidyr.tidyverse.org/reference/pivot_longer.html
https://dplyr.tidyverse.org/reference/summarise.html#ref-examples

ADD COMMENT
1
Entering edit mode

Another very common and automated filter prior to the DE analysis would be filterByExpr from edgeR.

ADD REPLY

Login before adding your answer.

Traffic: 2640 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6