Although this question was brought up a while ago, I thought it's worth giving some updates. Both of these 2 GFF3 files are based on experimental evidence yet took different routes. The details are summarized as follows,
Spliced_Junctions_Clustered.gff
We used high-throughput RNA sequencing data (RNA-seq data) from the Ecker and Mockler labs, and used alignment tools called Tophat and Supersplat to align these sequences to the Arabidopsis genome, resulting in 203,000 clustered spliced RNA-seq junctions.
Cited from ref1
TAIR10_gff3_genes.gff
utilized RNA-seq, proteomic datasets, gene models provided by NCBI and manually curated gene models from Swiss-Prot. It went through multiple steps including mapping, assembling, gene model construction etc. ref2
In summary, one could see Spliced_Junctions_Clustered.gff
as the direct output of TopHat, likely junctions.bed
, for RNA-seq data sets. On the other hand, TAIR10_gff3_genes.gff
used the datasets described above to update the gene models. There are more exon records in TAIR10_gff3_genes.gff
because it includes the junctions for ALL the transcripts, many of which may not be detected in the tissue used to generate Spliced_Junctions_Clustered.gff
. On the other hand, there should be a bunch of junctions unique to Spliced_Junctions_Clustered.gff
because those junctions may not go through the pipeline for TAIR10_gff3_genes.gff
, i.e. did not assemble into a transcript.
Thanks a lot for your comments!