I am trying to get refseq annotation (txStart, txEnd, cdsStart, cdsEnd, exonStarts, exonsEnds etc) from NCBI, in a format that is downloadable from UCSC Table Browser. The data from UCSC is in a table format (12 column BED) and easy to parse and manipulate. The NCBI data is in GFF3 format that I am not familiar with, and do not know how to extract annotation from GFF3.
Galaxy Browser is able to convert GFF3 to 12-column BED file, shared here Galaxy_refSeq_GFF3_to_BED. The 12-column BED output Galaxy Browser is similar to the UCSC format, but data in some columns are not clear.
The columns in the GGF3-> BED are
- chromosome name (NC_00000) - similar to UCSC hg19.refGene.chrom (can be easy converted to "chr" format)
- Start - same as UCSC hg19.refGene.txStart
- End - same as UCSC hg19.refGene.txEnd
- NM number with version - similar hg19.refGene.name, but without version no
- Score - similar to UCSC hg19.refGene.score
- Strand - same as UCSC hg19.refGene.strand
- Start - not same as UCSC hg19.refGene.cdsStart (but same as values in GFF3 column 2)
- End - not same as hg19.refGene.cdsEnd (but same as values in GFF3 column 3)
- Unknown column
- No of exons - similar to UCSC hg19.refGene.exonCount (but seems to be UCSC hg19.refGene.exonCount + 1)
- Exon sizes in bp - the size of each exon matches that in UCSC, but there is one extra number in the front
- Unknown set of numbers - matches the UCSC hg19.refGene.exonCount + 1 numbers. i.e., if a gene has 79 exons, there are 80 numbers in this column (like column 11).
My questions are for the following columns
7. Is cdsStart available in GFF3?
8. Is cdsEnd available in GFF3?
9. What is coded here?
10. Why is there always hg19.refGene.exonCount + 1 numbers? i.e., DMD which has 79 exons, shows up as 80 in GFF3
11. The exons sizes match as represented in UCSC, except for the presence of a first large number What is this first number?
12. A decreasing number is found. Not clear what this is. If we know the size of each exon (column 11), then the only item needed to define the exon structure are the start positions. but can't make sense of the numbers I am seeing in column 12.
Thanks in advance for any help.
@t_pod I am having the same issue with my GFF3 downloaded from NCBI. I converted it to BED12 using the Galaxy GFF-to-BED converter. I removed all rows that did not have 12 columns and duplicate rows. Now I have rows with n+1 items in column 11. Were you able to resolve this? If so, how did you fix it?