how do I query the tables (https://genome.ucsc.edu/cgi-bin/hgTables) to get a gtf file from ucsc that has gene biotype? The biotypes can come from Ensembl. But I want an annotation file with biotypes, if possible. (:
how do I query the tables (https://genome.ucsc.edu/cgi-bin/hgTables) to get a gtf file from ucsc that has gene biotype? The biotypes can come from Ensembl. But I want an annotation file with biotypes, if possible. (:
Since you have tagged the post with 'refseq', I am assuming you are interested in RefSeq annotation. If that is the case, I suggest you download the relevant files directly from NCBI FTP site. The GTF and GFF3 files for RefSeq annotation include gene_biotype
information in column 9.
Just adding for other users who land on this page.
Another solution is to simply generate a 'master' table in biomaRt:
require('biomaRt')
mart <- useMart('ENSEMBL_MART_ENSEMBL')
mart <- useDataset('hsapiens_gene_ensembl', mart)
Check that it is indeed GRCh38:
searchDatasets(mart = mart, pattern = 'hsapiens')
dataset description version
78 hsapiens_gene_ensembl Human genes (GRCh38.p13) GRCh38.p13
Now generate the table:
annotLookup <- getBM(
mart = mart,
attributes = c(
'hgnc_symbol',
'ensembl_gene_id',
'refseq_mrna',
'refseq_ncrna',
'gene_biotype'),
uniqueRows = TRUE)
head(annotLookup)
hgnc_symbol ensembl_gene_id refseq_mrna refseq_ncrna gene_biotype
1 MT-TF ENSG00000210049 Mt_tRNA
2 MT-RNR1 ENSG00000211459 NR_137294 Mt_rRNA
3 MT-TV ENSG00000210077 Mt_tRNA
4 MT-RNR2 ENSG00000210082 NR_137295 Mt_rRNA
5 MT-TL1 ENSG00000209082 Mt_tRNA
6 MT-ND1 ENSG00000198888 protein_coding
Kevin
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Hi vkkodali,
I have the same question, I followed your answer and download hg19_gtf file but in column 9, there is only gene id, not gene_biotype (screen capture link: https://drive.google.com/file/d/1OkpcDF_u2-yzAKlg8s46vIVOi8pc4AZ1/view?usp=sharing )
Please download the GTF from GENCODE: https://www.gencodegenes.org/
Please download data from NCBI RefSeq FTP site, not UCSC. For hg19, you can search for
GRCh37
in NCBI Assembly portal to get to this page. Once you are there, click on the 'Download Assembly' button, choose 'RefSeq' as source database and GTF as your file type. You will end up downloading a tarball with the GTF file. Alternatively, you can go to the FTP path directly by clicking on the 'FTP directory for RefSeq assembly' link on the right-hand bar and choose the file of interest to you.^^ this can work, too.