Question

Different Ensembl Ids point to the same gene symbol.

6

Entering edit mode

5.4 years ago

elefth.pavlos ▴ 60

Hi all,

I have a matrix with RNA-seq counts of reads. Ensemble gene IDs in rows, samples in columns. I want to check for differential expression between 2 groups of samples.

I retrieved gene symbols using biomaRt and noticed that for different ensembl IDs took back the same gene symbol. I was thinking to merge different Ensemble IDs entries and sum up their reads counts. Is that a consistent approach?

In addition, for some IDs (e.g ENSG00000069712) on Ensembl website (GRCh38.p12) appear to be retired while on Archive Ensemble (GRCh38.p7) I get an associated gene.

Current Ensembl website

Archived Ensemble

Thanks!

RNA-Seq • 14k views

ADD COMMENT • link updated 2.3 years ago by ajay nair ▴ 50 • written 5.4 years ago by elefth.pavlos ▴ 60

0

Entering edit mode

This seems exactly the follow-up question to "Why am I getting different ensembl gene ids for a given gene symbol?" that Emily_Ensembl suggested to start a new post. Even I came here with the same question: Is it okay to sum up the raw-counts/TPM of different Ensembl Ids of the same gene name?

ADD REPLY • link 5.0 years ago by ajay nair ▴ 50

score 3 · Answer 1 · 2020-06-16

Merging different Ensemble IDs by summing up their read counts does not seem to be a consistent approach. A nice explanation of why multiple id mapping occur is explained here. I have faced this issue, of multiple Ensembl IDs mapping to the same gene name, multiple times. I am not aware of a general approach but here is the approach I use.

Remove low count genes in the raw data matrix before normalization (most of the retired Ensemble gene IDs had zero raw counts in my datasets)

Do normalization. In the normalized gene expression matrix, convert the Ensemble gene IDs to gene symbols using biomart (any remaining retired IDs get removed here).

library( "biomaRt" ) #example code for mouse gene id mapping
ensembl = useMart( "ensembl", dataset = "mmusculus_gene_ensembl" )
genemap <- getBM( attributes = c("ensembl_gene_id", "mgi_symbol"), filters = "ensembl_gene_id",values = rownames(my_normalizedMatrix), mart = ensembl )

After this step very few genes with multiple map remain which I check manually and decide. In my mouse example dataset the following genes were present at this stage.

. ensembl_gene_id mgi_symbol

18134 ENSMUSG00000086915 Gm16364

18166 ENSMUSG00000087014 Gm16364

16916 ENSMUSG00000082803 Gm26460

19095 ENSMUSG00000092802 Gm26460

13405 ENSMUSG00000057626 Rpl10-ps6

16284 ENSMUSG00000080885 Rpl10-ps6

9693 ENSMUSG00000038729 Pakap

18577 ENSMUSG00000089945 Pakap

18619 ENSMUSG00000090053 Pakap

score 1 · Answer 2 · 2019-07-16

1

Entering edit mode

5.4 years ago

benformatics 4.0k

You do not have multiple counts for the same gene loci. That would not make any sense. The reality is that there are genes with multiple "copies" across the genome (e.g. rRNA).

I suggest you stick with the gene IDs that are unique (i.e. ENSG####) and only choose one label for each of those genes.

If you use biomaRt and hg38 and extract the "ensembl_gene_id" and "external_gene_name" attributes. You will only get one gene name per Ensembl gene id.

You should only think about merging counts, if the genes that you are interested in or those that are drastically differentially expressed are among the 2000+ gene symbols with multiple copies. In this case I would worry about it retroactively.

ADD COMMENT • link 5.4 years ago by benformatics 4.0k

0

Entering edit mode

In the case of some genes having multiple copies, multi-mapping issue would also exist. When assigning reads using featureCounts, the default would drop the multi-mapping reads. Therefore, merging reads wouldn't be accurate (it's more of a multi-mapping issue). Based on this, would you recommend using EM-based methods when assigning reads?

ADD REPLY • link 2.7 years ago by ccfpwll ▴ 10

1

Entering edit mode

featureCounts will do this if you tell it to with the -M flag if you had your aligner output one alignment per read then the merging would still work. The conservative approach would be to drop the multi-mappers anyway. I'm sure you could try an EM-based method but it complicates your whole analysis. Though, if you are very interested in those regions I don't see any strong argument against trying it.

ADD REPLY • link 2.7 years ago by benformatics 4.0k