Hello All,
I am new to the RNA-seq world and especially new to the bioinformatics side. We recently completed a RNA-seq experiment (total RNAs) on human samples and we used illumina's Dragen RNA pipeline which generated salmon gene count (.sf) output files. In the files, the gene ID is in ensembl gene ID format with version numbers, as follows:
Name Length EffectiveLength TPM NumReads
ENSG00000223972.4 1483 1290.1 0.065 1.63
ENSG00000227232.4 1612 1415.02 10.139 281.06
ENSG00000243485.2 462 314.72 0 0
ENSG00000237613.2 889 720.64 0 0
ENSG00000268020.2 483 347.96 0 0
ENSG00000240361.1 940 774.95 0 0
ENSG00000186092.4 918 752.97 0 0
ENSG00000238009.2 2079 1905.46 1.007 37.61
ENSG00000239945.1 1319 1147.49 0.224 5.03
.... (there are a total of more than 50,000 ENSG numbers. )
I'd like to convert these ENSG IDs to ENSG stable IDs, gene symbols, and also have gene description, gene type, Gene length, if possible.
So far, I've tried to use the ensembl biomart webpage interface. I was able to paste all the >50000 ENSG IDs (as shown above), however, the output only has about 26,000 gene IDs; In addition, the order of the 26000 genes are different from my input. Do you know why this happens? I was expecting a csv table showing both the input and output both in the same rows. But I don't see input in the output file.
Out put file are as follows:
Gene stable ID Gene name Transcript length (including UTRs and CDS) Gene description Gene type
ENSG00000019995 ZRANB1 5695 zinc finger RANBP2-type containing 1 [Source:HGNC Symbol;Acc:HGNC:18224] protein_coding
ENSG00000019995 ZRANB1 587 zinc finger RANBP2-type containing 1 [Source:HGNC Symbol;Acc:HGNC:18224] protein_coding
ENSG00000039139 DNAH5 15633 dynein axonemal heavy chain 5 [Source:HGNC Symbol;Acc:HGNC:2950] protein_coding
ENSG00000039139 DNAH5 760 dynein axonemal heavy chain 5 [Source:HGNC Symbol;Acc:HGNC:2950] protein_coding
ENSG00000039139 DNAH5 676 dynein axonemal heavy chain 5 [Source:HGNC Symbol;Acc:HGNC:2950] protein_coding
ENSG00000039139 DNAH5 2081 dynein axonemal heavy chain 5 [Source:HGNC Symbol;Acc:HGNC:2950] protein_coding
ENSG00000053328 METTL24 1119 methyltransferase like 24 [Source:HGNC Symbol;Acc:HGNC:21566] protein_coding
ENSG00000053328 METTL24 777 methyltransferase like 24 [Source:HGNC Symbol;Acc:HGNC:21566] protein_coding
.... (there are a total of about 26000 rows)
So I then tried to install "biomaRt" in R via bioconductor (BiocManager) - however I couldn't complete the installation of biomaRt due to some errors (this is in ubuntu computer). I then switched to windows computer and was able to install biomaRt in R - however it shows that connection to server is not good (I saw on ensembl web news they are migrating servers?).
So then I installed "EnsDb.Hsapiens.v86" in R, on windows computer. But due to my lack of knowledge right now, I couldn't figure out the code for the conversion of gene IDs (I have some knowledge in shell scripts, python and can understand code with annotations).
Could you guys point out some resources such as example code for such Gene ID conversion using either biomaRt or EnsDb.Hsapiens.v86 ? (I did read the reference manual for EnsDb.Hsapiens.v86 but couldn't figure out quickly how to use a .fa file to input the query GeneIDs)...
Thanks so much! & Sorry for the long post. Jian
Hi! I am having a kind of similar issue, maybe someone could help me. My salmon gene quant output file is also on ensembl gene ID format with version numbers form, but when I try to match the ensembl list for mmusculus_gene_ensembl using biomart, I get no matching genes. I believe the issue is the version numbers that are in my salmon output files, but I do not seem to find a way of figuring this out
my salmon output file with gene ID format with version numbers
vs the mmusculus_gene_ensembl
I would really appreciate some help. thank you!
Remove the version numbers using directions here and it should work: Mapping Ensembl Gene IDs with dot suffix
thank you so much!