How can I convert the output from HTSeq count from ENSG IDs with counts to HGNC gene symbols with counts?
If I use Biomart online or in R (see code below) I lose the ensembl gene IDs that don't have a corresponding symbol or that collapse to a single symbol. I am starting with 57778 ensembl IDs and am returned 35699 gene symbols. This is a problem since the gene symbols are returned in a different order and without their corresponding counts, complicating further analysis. I would like to use the gene symbols and counts together for downstream pathway analysis following edgeR or DESeq2. Any guidance is appreciated.
MLL<- read.delim("/Path.txt", header=FALSE)
colnames(MLL)<- c("ENSEMBL_GENE_ID", "Counts")
human = useMart("ENSEMBL_MART_ENSEMBL", datatset="hsapiens_gene_ensembl")
results<- getBM(attributes=c("hgnc_symbol"), values=MLL$ENSEMBL_GENE_ID, mart=human)
Below is a summary of the problem: gene symbols are fewer in number and I am not sure how to link the counts to the symbols
MLL:
ENSEMBL_GENE_ID COUNTS
1 ENSG00000000003 4
2 ENSG00000000005 0
3 ENSG00000000419 586
4 ENSG00000000457 384
... row 57778
results:
hgnc_symbol
1 GENEA
2 GENEB
3 GENEC
... row 35699