Question

error in duplicate identification

0

Entering edit mode

21 months ago

Mamatha Y S • 0

# duplicated genes and number of duplicates
duplicated_genes <- names(table(df$hgnc_symbol)[table(df$hgnc_symbol) > 1])
gene_counts <- table(df$hgnc_symbol)[duplicated_genes]

#zero expression of each gene
zero_counts <- sapply(unique(duplicated_genes), function(gene) {
  sum(rowSums(df[df$hgnc_symbol == gene, -ncol(df)]) == 0)
})

This is the code I'm running. I want to identify duplicate gene from my data frame, and their frequency and in third column I want to know in each duplicated for example its duplicated 7 times, in this 7 times how many of them having rowsum zero (gene expression zero for all samples).

First two lines I'm getting correct result but zero expression I'm getting NA for all the genes I m not getting why. Please help me with this

r RNA-seq • 385 views

ADD COMMENT • link updated 21 months ago by Ram 44k • written 21 months ago by Mamatha Y S • 0

1

Entering edit mode

Is the hgnc_symbol the last column in your df? Is that why you're using -ncol(df) for the rowSums function?

You're getting NA because some values in your df are NA. You could use na.rm = TRUE parameter in the sum function as long as you understand what it's doing and the fact that you're expecting 0 and there's also NA in there indicating there must either be a gap in your expectations or a difference in what 0 and NA mean.

ADD REPLY • link 21 months ago by Ram 44k