How do I get the gene annotation for the latest version (GRCh38)?
2
1
Entering edit mode
8.3 years ago
line1438 ▴ 40

I have the gene annotation for all chromosome and its SNPs for the version GRCh37.68, just like


34554 36081 ENSG00000237613 FAM138A

69091 70008 ENSG00000186092 OR4F5

367640 368634 ENSG00000235249 OR4F29

621059 622053 ENSG00000185097 OR4F16

721320 722513 ENSG00000197049 AL669831.1

860260 879955 ENSG00000187634 SAMD11

879584 894670 ENSG00000188976 NOC2L

895967 901095 ENSG00000187961 KLHL17


it is a file recode all SNPs for a chromosome and the information that the SNP belong which gene.

I want to get the gene annotation for 1 to 23 chromosomes for the latest version (GRCh38),

the format of gene annotation just like above-mentioned,

what should I do?

Thanks a lot!

gene GWAS • 5.4k views
ADD COMMENT
0
Entering edit mode

can be obtained from ensemble biomart

ADD REPLY
7
Entering edit mode
8.3 years ago
EagleEye 7.6k

Unzip using:

gunzip -d Homo_sapiens.GRCh38.85.gtf.gz

Convert into table format:

cat Homo_sapiens.GRCh38.85.gtf | awk 'BEGIN{FS="\t"}{split($9,a,";"); if($3~"gene") print a[1]"\t"a[3]"\t"$1":"$4"-"$5"\t"$7}' | sed 's/gene_id "//' | sed 's/gene_id "//' | sed 's/gene_biotype "//'| sed 's/gene_name "//' | sed 's/"//g' > Homo_sapiens.GRCh38.85_table.txt

The above command will convert GTF into annotation table as below,

ENSG00000223972  DDX11L1    1:11869-14409   +
ENSG00000227232  WASH7P 1:14404-29570   -
ENSG00000278267  MIR6859-1  1:17369-17436   -
ENSG00000243485  MIR1302-2  1:29554-31109   +
ENSG00000237613  FAM138A    1:34554-36081   -
ENSG00000268020  OR4G4P 1:52473-53312   +
ENSG00000240361  OR4G11P    1:62948-63887   +
ENSG00000186092  OR4F5  1:69091-70008   +
ENSG00000238009  RP11-34P13.7   1:89295-133723  -
ENSG00000239945  RP11-34P13.8   1:89551-91105   -
ENSG00000233750  CICP27 1:131025-134836 +
ENSG00000268903  RP11-34P13.15  1:135141-135895 -
ENSG00000269981  RP11-34P13.16  1:137682-137965 -
ENSG00000239906  RP11-34P13.14  1:139790-140339 -
ADD COMMENT
0
Entering edit mode

Thank you very much!

Your answer are the best for me.

Thanks a lot.

ADD REPLY
0
Entering edit mode

Good luck :)

ADD REPLY
0
Entering edit mode

I try to find the same website by myself in the ensembl.org

but I seem to can't find it...

could you tell me where the wrong with me?

I want to find the website you mentioned :

If you are just looking for ensembl gene annotation,

ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/

below list the steps that I done

  1. Go to http://asia.ensembl.org/index.html

  2. Human GRCh38.p7 http://asia.ensembl.org/Homo_sapiens/Info/Index

then I try to find the all download, but still not to find the same website you mentioned...

ADD REPLY
0
Entering edit mode

go to the ftp site

ADD REPLY
0
Entering edit mode

Thank you!

I already know how to get into the website to get the file.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

The GRCh38 is the latest version of genomes now,

if the version of GRCh39 has come out in the future,

can I get the gene annotation of GRCh39 in the same ftp website? (ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/)

ADD REPLY
1
Entering edit mode

Yes, this ftp link will be updated with current genome version whenever it is releases. If in future GRCh39 is released, the current ftp link will be updated for new assembly version.

JFI: But keep in mind that the transcript/gene annotation version keeps on updating (Here you got annotation version 85 for GRCh38, GRCh38.85. There are previous versions starts from GRCh38.76-84). Example, for GRCh37 assembly there was 18 different ensembl transcript/gene annotation versions (GRCh37.57-75).

If you check this link you will get an idea about the version history or you can also use GTF annotation from the following GENCODE link but it follows bit different gene names (ENSGs, there will be extra revision numbers in the end of each ENSG name, Example ENSGXXXX will be represented as ENSGXXXX.2).

http://www.gencodegenes.org/releases/

If you want to convert Gencode GTF to simple table format

ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_25/gencode.v25.annotation.gtf.gz

Unzip as I mentioned in earlier post.

cat gencode.v25.annotation.gtf | awk 'BEGIN{FS="\t"}{split($9,a,";"); if($3~"gene") print a[1]"\t"a[5]"\t"$1":"$4"-"$5"\t"a[3]"\t"$7}' |sed 's/gene_id "//' | sed 's/gene_id "//' | sed 's/gene_type "//'| sed 's/gene_name "//' | sed 's/"//g' | awk 'BEGIN{FS="\t"}{split($3,a,"[:-]"); print $1"\t"$2"\t"a[1]"\t"a[2]"\t"a[3]"\t"$4"\t"$5"\t"a[3]-a[2];}' > gencode.v25.annotation_annotation.txt
ADD REPLY
1
Entering edit mode

Small correction,

echo -e "Geneid\tGeneSymbol\tChromosome\tStart\tEnd\tClass\tStrand\tLength"; zcat gencode.v25.annotation.gtf.gz | awk 'BEGIN{FS="\t"}{split($9,a,";"); if($3~"gene") print a[1]"\t"a[4]"\t"$1":"$4"-"$5"\t"a[2]"\t"$7}' |sed 's/gene_id "//' | sed 's/gene_id "//' | sed 's/gene_type "//'| sed 's/gene_name "//' | sed 's/"//g' | awk 'BEGIN{FS="\t"}{split($3,a,"[:-]"); print $1"\t"$2"\t"a[1]"\t"a[2]"\t"a[3]"\t"$4"\t"$5"\t"a[3]-a[2];}'  > gencode.v25.annotation_annotation.txt
ADD REPLY
0
Entering edit mode

Thank you so much. :)

ADD REPLY
4
Entering edit mode
8.3 years ago
EagleEye 7.6k

If you are just looking for ensembl gene annotation,

ftp://ftp.ensembl.org/pub/current_gtf/homo_sapiens/

For more downloads,

http://www.ensembl.org/info/data/ftp/index.html

FTP:

ftp://ftp.ensembl.org/pub/

ADD COMMENT
0
Entering edit mode

Excuse me, I don't know how to get the file of gene annotation I wanted in your mentioned website.

can you tell me the detail about the process of catching the gene annotation for 1-23 chromosomes

the format of gene annotation I wanted is

34554 36081 ENSG00000237613 FAM138A

69091 70008 ENSG00000186092 OR4F5

367640 368634 ENSG00000235249 OR4F29

621059 622053 ENSG00000185097 OR4F16

721320 722513 ENSG00000197049 AL669831.1

Thanks a lot.

ADD REPLY
0
Entering edit mode

I found a file named "Homo_sapiens.GRCh38.85.gtf.gz", is this you say the gene annotation?

and I have a question, how do I convert this gtf file to the txt file or a format relatively to see the position of all gene?

ADD REPLY

Login before adding your answer.

Traffic: 1030 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6