I have a matrix with RNA-seq counts of reads. Ensemble gene IDs in rows, samples in columns. I want to check for differential expression between 2 groups of samples.
I retrieved gene symbols using biomaRt and noticed that for different ensembl IDs took back the same gene symbol. I was thinking to merge different Ensemble IDs entries and sum up their reads counts. Is that a consistent approach?
In addition, for some IDs (e.g ENSG00000069712) on Ensembl website (GRCh38.p12) appear to be retired while on Archive Ensemble (GRCh38.p7) I get an associated gene.
Merging different Ensemble IDs by summing up their read counts does not seem to be a consistent approach. A nice explanation of why multiple id mapping occur is explained here. I have faced this issue, of multiple Ensembl IDs mapping to the same gene name, multiple times. I am not aware of a general approach but here is the approach I use.
Remove low count genes in the raw data matrix before normalization (most of the retired Ensemble gene IDs had zero raw counts in my datasets)
Do normalization. In the normalized gene expression matrix, convert the Ensemble gene IDs to gene symbols using biomart (any remaining retired IDs get removed here).
After this step very few genes with multiple map remain which I check manually and decide. In my mouse example dataset the following genes were present at this stage.
Hi Ajay, I've also been analyizng RNA-seq data for mouse, and I've found that by directly parsing the genome annotation file, there's 100ish genes names that have multiple gene ids pointing to them.
Did you lose any genes after biomaRt conversion?
As you can see from the example above, most of the genes I lose are not the well-known or usually interesting genes. So it was not a problem for my work. Did you check what are those 100ish genes in your case?
You do not have multiple counts for the same gene loci. That would not make any sense. The reality is that there are genes with multiple "copies" across the genome (e.g. rRNA).
I suggest you stick with the gene IDs that are unique (i.e. ENSG####) and only choose one label for each of those genes.
If you use biomaRt and hg38 and extract the "ensembl_gene_id" and "external_gene_name" attributes. You will only get one gene name per Ensembl gene id.
You should only think about merging counts, if the genes that you are interested in or those that are drastically differentially expressed are among the 2000+ gene symbols with multiple copies. In this case I would worry about it retroactively.
In the case of some genes having multiple copies, multi-mapping issue would also exist. When assigning reads using featureCounts, the default would drop the multi-mapping reads. Therefore, merging reads wouldn't be accurate (it's more of a multi-mapping issue). Based on this, would you recommend using EM-based methods when assigning reads?
featureCounts will do this if you tell it to with the -M flag if you had your aligner output one alignment per read then the merging would still work. The conservative approach would be to drop the multi-mappers anyway. I'm sure you could try an EM-based method but it complicates your whole analysis. Though, if you are very interested in those regions I don't see any strong argument against trying it.
This seems exactly the follow-up question to "Why am I getting different ensembl gene ids for a given gene symbol?" that Emily_Ensembl suggested to start a new post. Even I came here with the same question: Is it okay to sum up the raw-counts/TPM of different Ensembl Ids of the same gene name?