Question

How to add gene annotation to a UCSC assembly hub?

4

Entering edit mode

10.3 years ago

Ian 6.1k

I am making my first UCSC assembly hub to display a non-UCSC annotated genome within the browser. All is well except that I cannot work out how to add the gene annotation, which is currently in GFF3 format. I am aware that track hubs only except the "big" file versions, so presumably a bigBed version of the annotation is required. Does anyone know of a handy method of converting GFF3 to BED/bigBED? I think BED12 is required I to retain the differentiation between CDS, UTR and introns...

Thank you!

P.S. I have Googled this! Convert .Gff3 File To 12-Column .Bed File is a help, but I would be interested to know if there have been developments since then.

EDIT: GTF or GFF2 can be used for gene annotation!

assembly hub gtf gff3 UCSC • 4.0k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Ian 6.1k

Ram · Accepted Answer · 2015-03-30

In the end I contacted UCSC browser directly. I got a helpful and detailed reply that I have edited to make it clearer how the necessary programs can be obtained. This is run in 64bit Linux. IMPORTANT NOTE: my question specified GFF3 as the starting format for the annotation, but it appeared to be much easy using GTF / GFF2.

Fetch the programs

wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/faToTwoBit
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/twoBitInfo
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/gtfToGenePred
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/extractGtf.pl
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredToBed
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/genePredCheck
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/ixIxx
wget http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/bedToBigBed
chmod +x genePredToBed genePredToBed genePredCheck bedToBigBed faToTwoBit twoBitInfo ixIxx

Download Perl scripts from their GIT repository

http://genome-source.cse.ucsc.edu/gitweb/?p=kent.git;a=tree;f=src/hg/utils/automation

extractGtf.pl
ensemblInfo.pl

Method

# Create twoBit version of genome
faToTwoBit genome.fa genome.2bit

# Get chromosome length from twoBit genome
twoBitInfo genome.2bit stdout | sort -k2rn > genome.chrom.sizes

# Convert GTF annotation to genePred format
gtfToGenePred -infoOut=infoOut.txt -genePredExt genome.gtf genome.gp

# Check the genePred output is valid
genePredCheck genome.gp

# Convert genePred format to BED format
genePredToBed genome.gp stdout | sort -k1,1 -k2,2n > genome.bed

# Convert BED to bigBed
# extraIndex required for position/search
bedToBigBed -type=bed12 -extraIndex=name genome.bed genome.chrom.sizes genome.bb

# Required for indexing step
grep -v "^#" infoOut.txt | awk '{printf "%s\t%s,%s,%s,%s,%s\n", $1,$2,$3,$8,$9,$10}' > genome.nameIndex.txt

# Create index for position/search function in browser
ixIxx genome.nameIndex.txt genome.nameIndex.ix genome.nameIndex.ixx