Hello,
Is there a way to detect discrepancies in GFF files such as:
1- A gene has 3 transcripts, 1 is oriented on the forward strand, 2 are oriented on the reverse strand. I tried with Gffread, but it just affects the value of the first transcript met.
2- A transcript that's on chromosome #2 is associated to a gene that's on chromosome #1.
I've seen these situations on an official annotation from UCSC, so I wonder if there's a way to automatically detect and correct them. I've seen them using mikado compare, because it crashes when finding these situations. Maybe AGAT could handle that (I didn't find but I think it's worth asking) ?
Thanks for the help !
EDIT: AGAT actually detects discrepancies in agat_sp_compare_two_annotations.pl
. It outputs "We fixed 1 wrong level2 location cases" for the first case (transcripts oriented both ways). But can we output the corrected files ?
agat_sp_compare_two_annotations.pl
is to detect fusion/split of genes between two annotation builds. All scripts with the_sp_
prefix pass through the the GFF AGAT parser (the others with_sq_
use the Bioperl parser). It is why you see fixes when usingagat_sp_compare_two_annotations.pl
. But if your goal is just to check the sanity of your file, as mentioned by @Michael Dondrup,agat_convert_sp_gxf2gxf.pl
is the script to use (you will get a log file resuming the problem detected by AGAT and modifications made)."We fixed 1 wrong level2 location cases" means the mRNA boundaries is wrong according to the sub features attached to it. (e.g one exon is over the mRNA boundary. So AGAT extends the mRNA to include the exon.)
I don't think cases described in 1) might be detected (btw never seen that before... if you could share one example it would be interesting).
For the case2, if the Parent/ID relationship is properly defined, or if they collectively share the same locus_tag/gene_id attribute, AGAT will not detect such case neither.
The solution would be to remove all Parent/ID relationship and locus_tag/gene_id attributes, and let AGAT reshape everything (using --merge_loci option to merge mRNA with overlap in their CDS under a uniq gene umbrella.). But this solution will wipe real cases of overlapping genes if any (Overlapping CDS of different genes is quite rare in eukaryote). If you don't use the --merge_loci option, each isoform will have its own gene feature as parent.
You should use GFF3toolkit it is much appropriate. I guess your errors are labelled by these error codes
Ema0007
andEma0009
.Thanks a lot for your replies. I will try with GFF3toolkit and agat_convert_sp_gxf2gxf.pl then.
Here is how to reproduce the errors:
Case1 also happens with galGal6.refGene.gtf.gz - UCSC downloads but not case2 - there might be other examples though. I think the main issue comes from the geneID being identical to geneName.
Which genome are we talking about? These deviations appear quite grave. In case these are errors, I would prefere to have them fixed upstream because even though we cannot think of any reason right now, there might well be one (fusion transcripts, trans-splicing, aligned experimental transcript sequence, circular transcripts, etc.).
If you need a quick fix, it looks like the AGAT script
agat_convert_sp_gxf2gxf.pl
could be the script that outputs a 'corrected' version. However, I would double-check if the output conforms to your expectations. I haven't used AGAT before, btw, but it looks very promising.We're talking about chicken genome annotation from UCSC with NCBI RefSeq annotation. To get it:
wget https://hgdownload.soe.ucsc.edu/goldenPath/galGal6/bigZips/genes/galGal6.ncbiRefSeq.gtf.gz
. A simple grep of CCL4 (case #1 about transcripts orientations) or RAP1GAP2 (case #2 about chromosomes number) will show you the discrepancies. It's even clearer when converting GTF to GFF3 first.It looks like there are two copies of CCL4 gene oriented towards (and near) each other. Go this link and then search with
CCL4
insearch assembly
box.There is only one location for
RAP1GAP2
onchr19
. So I am not sure where the other annotation is coming fromYes, there are 2 close by CCL4 genes, but does it really make sense ? I guess that using a better geneID instead of geneName would prevent (some of) these errors. Do you know of a tool to directly integrate geneID from Ensembl to these GTF annotations files ?
CCL4 copies are not overlapping. Since Chicken genome has been through a few builds the sequence in that region must be stable, if the two copies are being annotated as they are. Ensembl also has the two copies.
Annotation for
RAPGAP2
onchr21
is based on NCBI's automatedgnomon
eukaryotic annotation pipeline. It may not have strong experimental support. So take that into account. Not sure why UCSC has included it in theirRefSeq
category.