Where Can I Get Annovar 'Refgene' Format?
4
4
Entering edit mode
12.7 years ago
jessada ▴ 150

I used the ANNOVAR command line

annotate_variation.pl -downdb -buildver hg19 refGene humandb

to download hg19_refGene.txt from UCSC and I'll use this database to create the input file for ANNOVAR in format http://www.openbioinformatics.org/annovar/annovar_filter.html#generic but all information I can get for the refGene format is from http://genome.ucsc.edu/FAQ/FAQformat

(
string  geneName;           "Name of gene as it appears in Genome Browser."
string  name;               "Name of gene"
string  chrom;              "Chromosome name"
char[1] strand;             "+ or - for strand"
uint    txStart;            "Transcription start position"
uint    txEnd;              "Transcription end position"
uint    cdsStart;           "Coding region start"
uint    cdsEnd;             "Coding region end"
uint    exonCount;          "Number of exons"
uint[exonCount] exonStarts; "Exon start positions"
uint[exonCount] exonEnds;   "Exon end positions"
)

which is not sufficient because the downloaded refGene has more columns. For example

1475    NM_000039    chr11    -    116706468    116708338    116706523    116708103    4    116706468,116707716,116708060,116708320,    116707127,116707873,116708123,116708338,    0    APOA1    cmpl    cmpl    2,1,0,-1,

I tried to look many place to find the meaning of the last 6 columns. Anyone here can give the site that can explain the meaning of those columns?

annovar ucsc • 13k views
ADD COMMENT
0
Entering edit mode

The format description has been updated: http://genome.ucsc.edu/FAQ/FAQformat#format9

But it is still wrong: before name there is an something non-unique called bin and the uint id is the score.

Looks like there was no primary key for the data then.

ADD REPLY
5
Entering edit mode
12.7 years ago

as far as i can see in curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.sql"

  `bin` smallint(5) unsigned NOT NULL,
  `name` varchar(255) NOT NULL,
  `chrom` varchar(255) NOT NULL,
  `strand` char(1) NOT NULL,
  `txStart` int(10) unsigned NOT NULL,
  `txEnd` int(10) unsigned NOT NULL,
  `cdsStart` int(10) unsigned NOT NULL,
  `cdsEnd` int(10) unsigned NOT NULL,
  `exonCount` int(10) unsigned NOT NULL,
  `exonStarts` longblob NOT NULL,
  `exonEnds` longblob NOT NULL,
  `score` int(11) default NULL,
  `name2` varchar(255) NOT NULL,
  `cdsStartStat` enum('none','unk','incmpl','cmpl') NOT NULL,
  `cdsEndStat` enum('none','unk','incmpl','cmpl') NOT NULL,
  `exonFrames` longblob NOT NULL,

all the fields you need are present in http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz

you just need to remove the some columns (bin...)

ADD COMMENT
0
Entering edit mode

Hi Pierre,

How to create RefGene file for virus using annovar

ADD REPLY
1
Entering edit mode
9.7 years ago
gresserT ▴ 50

Under "describe table schema" is the complete and right description:

http://genome.ucsc.edu/cgi-bin/hgTables?hgta_track=refGene

ADD COMMENT
1
Entering edit mode
6.1 years ago
lffu_0032 ▴ 90

you can visit http://genome.ucsc.edu/cgi-bin/hgTables and then click "describe the schema" button, the you can see the RefSeq gene predictions format.

ADD COMMENT

Login before adding your answer.

Traffic: 1736 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6