Multiple ensembl gene ID for the same gene name (Symbol), how to deal with this while differential analysis?
1
1
Entering edit mode
3.8 years ago

Dear all, I am processing my RNA-seq data, during processing I observed multiple ensembl gene IDs for the same gene name (symbol). I went into the details and found these are from alternate loci (haplotypes/patches). Can anyone provide the complete list of genes (gene names/symbols) which are having multiple ensembl gene ids, due to haplotypes/patches for hg38 human genome build. It would be very helpful if you can also provide your views on how genes having multiple ensembl gene ids should be treated while differential analysis by various programs. ENSEMBL gene ID from primary assembly or haplotype showing highest read counts/expression or sum of read counts/expression from all ensembl gene ids for the same gene (symbol), what should I choose to deal with such a situation in best possible ways.

RNA-Seq alignment next-gen ensembl • 4.3k views
ADD COMMENT
0
Entering edit mode

what reference dataset are you using? It might be you're not using the correct one.

Best practice is to work with a reference where there is only one representative for each locus (unless you specifically want to look at isoforms for instance).

ADD REPLY
0
Entering edit mode

Thank you for your answer. I am using hg38 ensembl reference.

ADD REPLY
0
Entering edit mode

HI Lieven, From where i can get reference where there is only one representative for each locus for human build hg38? Can you provide such link. Thank you

ADD REPLY
0
Entering edit mode

What format do you want your reference in? Are you looking for GTF (to use with a traditional aligner) or cDNA FASTA (to use with a pseudo-aligner), or just a list? are you interested only in protein coding genes or all genes?

ADD REPLY
0
Entering edit mode

Thank you very much Sir.

ADD REPLY
0
Entering edit mode
3.8 years ago

Which pipeline are you using? If you are mapping to the genome, the easiest thing to do is to use on of the "analysis ready" genomes as your reference.

On Ensembl this is labeled as the "primary_assembly". E.g. here: ftp://ftp.ensembl.org/pub/release-102/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz

On UCSC it is labeled as the "Analysis set" e.g. here: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/analysisSet/hg38.analysisSet.chroms.tar.gz

ADD COMMENT
2
Entering edit mode

Likely you will still see a small number of examples where there are still two Ensembl gene IDs for a given Symbol. This will simply be a place where Ensembl thinks a particular locus is two genes, while HGNC thinks it is one.

ADD REPLY
0
Entering edit mode

I've met another possibility for this for the mouse genome primary assembly. There are situations when some genes that have multiple ids, one on a chromosome, and another on unlocalized sequences.

ADD REPLY

Login before adding your answer.

Traffic: 1854 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6