Hi all,
I am sure it has been ask before but I couldn't find the posts related. I just did some RNA-Seq alignment and quantification using nf-core pipeline. I picked the star_salmon method, which aligns reads with STAR and quantifies with Salmon using the quantification mode. I am also using the gtf from GENCODE46. However, in the output, there are multiple genes with the same name (e.g. CRLF2). They have different ensembl ids and I think they are either on different sex chromosomes or gene duplicates.
My question is, is there a proper way of handling them when I am doing DGE? I just left them as they are for now and do the analysis anyway, but should I combine them? Is there a known list of genes with multiple entries in ensembl that can be used to combine them together?
Here are those threads:
How to deal with the case that one gene symbol matches multiple ensembl ids?
Different Ensembl Ids point to the same gene symbol.
Why am I getting different ensembl gene ids for a given gene symbol?
Thanks a lot! The answers make sense. It seems there are people on both sides whether to merge it or not. I do see it makes more sense for my purpose and from a logical point of view to combine them. For example, I can see genes from PAR of the sex chromosomes splitting the expression level, which does not make sense as there are male and female in the cohort. I am wondering if it would be legit to combine them using tximport and a tx2gene that puts all the copies of genes under the same name together. Did Salmon correct the counts in a weird way for multimapped reads so that I shouldn't do it this way?