I have a gff file and saw a portion of it that looks like this:
DS239414 GenBank gene 3787 6375
DS239414 GenBank mRNA 3787 6375
DS239414 GenBank mRNA 3787 6375
DS239414 GenBank CDS 4036 4200
DS239414 GenBank CDS 4379 4561
DS239414 GenBank CDS 4645 4815
DS239414 GenBank CDS 4963 5129
DS239414 GenBank CDS 5611 5695
DS239414 GenBank CDS 5951 6050
DS239414 GenBank CDS 6187 6215
DS239414 GenBank exon 3787 4200
DS239414 GenBank exon 4379 4561
DS239414 GenBank exon 4645 4815
DS239414 GenBank exon 4963 5129
DS239414 GenBank exon 5611 5695
DS239414 GenBank exon 5951 6050
DS239414 GenBank exon 6187 6375
DS239414 GenBank CDS 4036 4200
DS239414 GenBank CDS 4379 4561
DS239414 GenBank CDS 4645 4815
DS239414 GenBank CDS 4963 5215
DS239414 GenBank CDS 5577 5695
DS239414 GenBank CDS 5951 6050
DS239414 GenBank CDS 6187 6215
DS239414 GenBank exon 4963 5215
DS239414 GenBank exon 5577 5695
The first thing I understand is that there are 2 different transcripts for this gene (because there are two mRNA fields). What appears weird, however, is that in the second transcript, the first 3 CDSs appear to be introns (because of the position where the first exon starts).
Am I right? Is this possible? Or is it something else that I don't understand in the gff format?
wwhere did you get the GFF?
It is incomplete by any gff standard, which must have 9 cols.
It looks like 1st 5 columns of output from bp_genbank2gff3.pl. Try the conversion yourself at http://www.hiv.lanl.gov/content/sequence/FORMAT_CONVERSION/form.html and you will get 'only differing exons'.
NCBI recently overhauled their GFF3 conversion software but the bits are not available yet, as noted here: https://groups.google.com/forum/?fromgroups#!topic/bioperl-l/TYbSSKNQZQM
http://www.ebi.ac.uk/cgi-bin/readseq.cgi gives gff2, which might serve you purpose better, depending...
Casey, You're right that it's a conversion error! I looked at the corresponding GenBank file and it looks like the GFF file reports only the differing exons for the second transcript. This is the reason why 5 CDSs (and not 3 as I mistakenly wrote in my question) appear to be inside introns.
I used bp_genbank2gff3.pl to create it from the corresponding GenBank file that I downloaded from NCBI. And yes, I used "cut" to exclude all "irrelevant" fields; I only wanted to ask you guys what was wrong with the second splice variant!