I used the ANNOVAR command line
annotate_variation.pl -downdb -buildver hg19 refGene humandb
to download hg19_refGene.txt from UCSC and I'll use this database to create the input file for ANNOVAR in format http://www.openbioinformatics.org/annovar/annovar_filter.html#generic but all information I can get for the refGene format is from http://genome.ucsc.edu/FAQ/FAQformat
(
string geneName; "Name of gene as it appears in Genome Browser."
string name; "Name of gene"
string chrom; "Chromosome name"
char[1] strand; "+ or - for strand"
uint txStart; "Transcription start position"
uint txEnd; "Transcription end position"
uint cdsStart; "Coding region start"
uint cdsEnd; "Coding region end"
uint exonCount; "Number of exons"
uint[exonCount] exonStarts; "Exon start positions"
uint[exonCount] exonEnds; "Exon end positions"
)
which is not sufficient because the downloaded refGene has more columns. For example
1475 NM_000039 chr11 - 116706468 116708338 116706523 116708103 4 116706468,116707716,116708060,116708320, 116707127,116707873,116708123,116708338, 0 APOA1 cmpl cmpl 2,1,0,-1,
I tried to look many place to find the meaning of the last 6 columns. Anyone here can give the site that can explain the meaning of those columns?
The format description has been updated: http://genome.ucsc.edu/FAQ/FAQformat#format9
But it is still wrong: before name there is an something non-unique called bin and the uint id is the score.
Looks like there was no primary key for the data then.