If you know the Ensembl build that you used to get the Gene IDs, you can easily get the corresponding Gene Names from Biomart or even the GTF file of the same build. If you want to do it in R, use the biomaRt package.
Option 1: Using awk to get Gene ID and Name from GTF file:
You could use awk to get associated Gene ID & Gene Names from the GTF file:
awk '{
for (i = 1; I <= NF; i++) {
if ($i ~ /gene_id|gene_name/) {
printf "%s ", $(i+1)
}
}
print ""
}' Homo_sapiens.GRCh37.70.gtf | sed -e 's/"//g' -e 's/;//g' -e 's/ /\t/' | sort -k1,1 | uniq > Homo_sapiens.GRCh37.70.txt
Read this file in R
> annot = read.delim('Homo_sapiens.GRCh37.70.txt', header=F)
> head(annot)
V1 V2
1 ENSG00000000003 TSPAN6
2 ENSG00000000005 TNMD
3 ENSG00000000419 DPM1
4 ENSG00000000457 SCYL3
5 ENSG00000000460 C1orf112
6 ENSG00000000938 FGR
Merge your existing file with this annotation, assuming your file name is existingfile and the column containing Ensembl Gene IDs is GeneID
merged.file = merge(existingfile, annot, by.x='GeneID', by.y='V1')
Option 2: Using biomaRt in R:
library(biomaRt)
# Get an archived version of ensembl i.e. ensembl 70 in this case
ensembl = useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host = "jan2013.archive.ensembl.org")
# example list of ENSG ids
ensemblID = c('ENSG00000242959','ENSG00000160396','ENSG00000229494')
# use this list to get corresponding Gene Symbols
results = getBM(attributes = c('hgnc_symbol','ensembl_gene_id'), filters = "ensembl_gene_id", values = ensemblID, mart = ensembl)
# If you have a csv file you can read it like this
ensemblID = read.csv('file.csv')
# get the GeneIDs in the csv file
ensemblID = ensemblID[,1]
# use biomaRt
results = getBM(attributes = c('hgnc_symbol','ensembl_gene_id'), filters = "ensembl_gene_id", values = ensemblID, mart = ensembl)
Comparison of the two methods:
Using Option 1, you get this for the first three Ensembl IDs:
ENSG00000242959 RP4-599G15.3
ENSG00000160396 HIPK4
ENSG00000229494 AC012494.1
Using Option 2:
hgnc_symbol ensembl_gene_id
HIPK4 ENSG00000160396
ENSG00000229494
ENSG00000242959
Therefore, I would recommend you use the first option.
Hi Komal,
Could you please explain to me how to do that in R. I have my data in .csv file.
Where did you get this data from? What is the Ensembl build?
I used ensemble annotation gtf release 37.7. I am going to identify antisense and I have my file which include gene id, antisense count, sense count, and strand .
here is the link for the GTF file:
ftp://ftp.ensembl.org/pub/release-70/gtf/homo_sapiens
I have updated my answer. From here on, I will leave things to you, you should really work on your R skills. Read biomaRt manual, but first learn R basics because merging files is a very simple task.
Also, there are many questions on Biostars like this.
Search this site for "biomart"; there are lots of examples.
Hi Komal,
I used the first option, and It works very well and I got all results that I need. I want your help to include the source column (
lincRNA
,antisense
,protein_coding
, ...) which is the second column in the ensemble gtf file using the awk command above.I hope you are not talking about the gene_biotype field in the GTF (because that's different than the second column). However, if you want to include the second column, you can get it by modifying the above code like this:
The second column that I mean as shown below:
Yes, the awk command above will do that. It will give the second column as well as the columns that you got before.
I got the results, but when I tried to read this file in R , I got it like that
so how can I use merge function to merge my file with this file (you can see V2 know include both gene id and gene name). so how can we separate them to be V1 V2 , and V3 to run merege function correctly
Read in the file like this:
Now you will get three separate columns, V1, V2 and V3. Then you can merge like before.
I did that and it worked well, but when I used merge function in R as shown below, it gave me more obs. than I want (i.e after merging my file with the annot file it supposed to give me 16442 obs., but it gave me 28453) so is there any option in merge function to match only by same gene id.
Alright here you go,
So you won't get a 1-to-1 relationship if you include the gene source, unfortunately.
I couldn't see it.
So is there any other way to get source that match gene id 1 to 1 only. or is there any way to remove the undesired sources from the list.
Honestly, you will have to ask your supervisor about that.
Thanks a lot for helping me.
}' Homo_sapiens.GRCh37.70.gtf | sed -e 's/"//g' -e 's/;//g' -e 's/ /\t/' | sort -k1,1
can anyone explain me this what each loop does and the sed command