Dear all, I am processing my RNA-seq data, during processing I observed multiple ensembl gene IDs for the same gene name (symbol). I went into the details and found these are from alternate loci (haplotypes/patches). Can anyone provide the complete list of genes (gene names/symbols) which are having multiple ensembl gene ids, due to haplotypes/patches for hg38 human genome build. It would be very helpful if you can also provide your views on how genes having multiple ensembl gene ids should be treated while differential analysis by various programs. ENSEMBL gene ID from primary assembly or haplotype showing highest read counts/expression or sum of read counts/expression from all ensembl gene ids for the same gene (symbol), what should I choose to deal with such a situation in best possible ways.
See: A: How to deal with the case that one gene symbol matches multiple ensembl ids?
what reference dataset are you using? It might be you're not using the correct one.
Best practice is to work with a reference where there is only one representative for each locus (unless you specifically want to look at isoforms for instance).
Thank you for your answer. I am using hg38 ensembl reference.
HI Lieven, From where i can get reference where there is only one representative for each locus for human build hg38? Can you provide such link. Thank you
What format do you want your reference in? Are you looking for GTF (to use with a traditional aligner) or cDNA FASTA (to use with a pseudo-aligner), or just a list? are you interested only in protein coding genes or all genes?
Thank you very much Sir.