GFF3 to 12-column BED
2
1
Entering edit mode
8.2 years ago
New2R ▴ 60

I am trying to get refseq annotation (txStart, txEnd, cdsStart, cdsEnd, exonStarts, exonsEnds etc) from NCBI, in a format that is downloadable from UCSC Table Browser. The data from UCSC is in a table format (12 column BED) and easy to parse and manipulate. The NCBI data is in GFF3 format that I am not familiar with, and do not know how to extract annotation from GFF3.

Galaxy Browser is able to convert GFF3 to 12-column BED file, shared here Galaxy_refSeq_GFF3_to_BED. The 12-column BED output Galaxy Browser is similar to the UCSC format, but data in some columns are not clear.

The columns in the GGF3-> BED are

  1. chromosome name (NC_00000) - similar to UCSC hg19.refGene.chrom (can be easy converted to "chr" format)
  2. Start - same as UCSC hg19.refGene.txStart
  3. End - same as UCSC hg19.refGene.txEnd
  4. NM number with version - similar hg19.refGene.name, but without version no
  5. Score - similar to UCSC hg19.refGene.score
  6. Strand - same as UCSC hg19.refGene.strand
  7. Start - not same as UCSC hg19.refGene.cdsStart (but same as values in GFF3 column 2)
  8. End - not same as hg19.refGene.cdsEnd (but same as values in GFF3 column 3)
  9. Unknown column
  10. No of exons - similar to UCSC hg19.refGene.exonCount (but seems to be UCSC hg19.refGene.exonCount + 1)
  11. Exon sizes in bp - the size of each exon matches that in UCSC, but there is one extra number in the front
  12. Unknown set of numbers - matches the UCSC hg19.refGene.exonCount + 1 numbers. i.e., if a gene has 79 exons, there are 80 numbers in this column (like column 11).

My questions are for the following columns

7. Is cdsStart available in GFF3?
8. Is cdsEnd available in GFF3?
9. What is coded here?
10. Why is there always hg19.refGene.exonCount + 1 numbers? i.e., DMD which has 79 exons, shows up as 80 in GFF3
11. The exons sizes match as represented in UCSC, except for the presence of a first large number What is this first number?
12. A decreasing number is found. Not clear what this is. If we know the size of each exon (column 11), then the only item needed to define the exon structure are the start positions. but can't make sense of the numbers I am seeing in column 12.

Thanks in advance for any help.

GFF3 BED • 4.4k views
ADD COMMENT
0
Entering edit mode
7.4 years ago
t_pod ▴ 30

Hello, I have a similar issue than you. As the annotation file was not available on UCSC, I've downloaded the gff3 file from NCBI and converted it to BED12 file via Galaxy.

However, I am finding many discrepancies in my converted BED12 file when I compare it to as "correct" BED12 file from UCSC:

5th and 9th columns are always "0" ( that might not be an issue),

in total 50% of lines have only the first 6 columns filled and they are just named "CDS" in the 4th column without further specification (should I discard all of them?),

and sometimes the column 11 displays n+1 items, where the additional item is "000" and n= blocks count (=column 10). I am expecting column 11 and 12 to be equal.

Any advice?

Thanks

ADD COMMENT
0
Entering edit mode

@t_pod I am having the same issue with my GFF3 downloaded from NCBI. I converted it to BED12 using the Galaxy GFF-to-BED converter. I removed all rows that did not have 12 columns and duplicate rows. Now I have rows with n+1 items in column 11. Were you able to resolve this? If so, how did you fix it?

ADD REPLY
0
Entering edit mode
9 months ago
alejandrogzi ▴ 140

There is this new tool called gxf2bed that outperforms most of the current tools. See the tool post here

ADD COMMENT
0
Entering edit mode

Please stop posting answers about your tool unless it exactly answers that particular question. If it does, add some text answering their question using your tool.

ADD REPLY

Login before adding your answer.

Traffic: 2989 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6