Question

Averaging duplicates of a gene in RNA-Seq dataset

5

Entering edit mode

3.4 years ago

mohammedtoufiq91 ▴ 260

Hi,

I am working with the RNA-Seq dataset and have raw counts file with me. I notice that, there are 58785 genes in the "Gene Symbol" column and some genes are repeated twice (shown below).In this scenario, what is the best practice to handle these types of genes? Do we simply average them or sum them before using them in downstream analysis?

dput(head(Counts, 5))
structure(list(symbol = c("BM", "A2GGG", "A2GGG", "P1P", 
"P1P"), Sample_A = c(0L, 0L, 82L, 46L, 6L), Sample_B = c(1L, 
0L, 64L, 49L, 5L), Sample_C = c(2L, 0L, 96L, 44L, 6L), Sample_D = c(5L, 
0L, 85L, 38L, 3L), Sample_E = c(1L, 0L, 80L, 48L, 6L), Sample_F = c(1L, 
0L, 77L, 49L, 4L)), row.names = c(NA, 5L), class = "data.frame")

Average

(A2GGG + A2GGG)/2 = A2GGG

Sum

A2GGG + A2GGG = A2GGG

Thank you,

Toufiq

expression differential average R rna-seq • 3.3k views

ADD COMMENT • link updated 3.4 years ago by biomon ▴ 60 • written 3.4 years ago by mohammedtoufiq91 ▴ 260

1

Entering edit mode

How did you generate the counts and how was the raw data processed?

ADD REPLY • link 3.4 years ago by biomon ▴ 60

0

Entering edit mode

And what genome version did you use?

ADD REPLY • link 3.4 years ago by biomon ▴ 60

0

Entering edit mode

I would prefer to take median instead of mean.

ADD REPLY • link 3.4 years ago by Nitin Narwade ★ 1.6k

0

Entering edit mode

Median and mean are the same when having only two values.

ADD REPLY • link 3.4 years ago by ATpoint 85k

0

Entering edit mode

ATpoint, thank you very much.

For the past data analysis experiments, I have used mean using the following

Counts = aggregate(Counts,FUN = mean,by=list(Counts$symbol))

So, I understand it is OK to use either mean or median right? Any inputs about usage of sum for aggregating the counts?

Are there any specific scenario, when mean, median or sum should be utilized?

ADD REPLY • link 3.4 years ago by mohammedtoufiq91 ▴ 260

3

Entering edit mode

I am not sure whether this makes sense. I know that duplicated gene names are a pain but these have unique Ensembl Gene IDs and come from different genomic coordinates, so average is suboptimal. Why not just using like EnsemblGeneID_GeneName as an identifier, so a concat of Ensembl and gene name? Then you can simply keep all genes. Or make them unique, like Gene1, and Gene1a, something like this, and then only care if they end up being differential. If not simply forget about them. Just thinking aloud.

ADD REPLY • link 3.4 years ago by ATpoint 85k

0

Entering edit mode

ATpoint, thank you.

The reason why I am trying to collapse the data into one single value is because, using this gene level matrix I would be mapping/merging to another third party gene annotation database. In case, If I use make.unique () then the genes renamed by a suffix (for instance Gene1, Gene1a ...) will be lost during the mapping process since the third party database would only contain gene (Gene1, but lack Gene1a or Gene1b etc). So it is important for me to include a averaged gene value while mapping.

ADD REPLY • link 3.4 years ago by mohammedtoufiq91 ▴ 260