Hi folks,
I am preparing a flat file of a fungal genome for submission to EMBL. I used EMBL-API_validator-1.1.263 to check the flat file and got warnings "WARNING: "exon" usually expected to be at least "15" nt long. Please check the accuracy. ". May you please advise how to fix it? Or should I just ignore it since it is just a warning?
Thank you very much in advance!
Do you have predicted exons that are < 15 nt in your file? They don't make a lot of sense at that length.How did you do the annotation? How was the genome assembled?
Hmm..it seems like so. One of the three exons (last one) is only 6 bases. The predicted protein itself is 63 aa though. Do you have a suggestion how should I treat this exon and the corresponding protein? It is a draft genome, assembled with Spades 3.10 and annotated using FUNGAP (https://github.com/CompSynBioLab-KoreaUniv/FunGAP). But for this protein, FUNGAP inherited the model from MAKER.
In case of this specific protein what is it most similar to when you do a blastp search? You may want/need to adjust the annotations (sequences) based on the homology you see.
I did blastp and the only match is a response regulator (1118 aa) from a bacterium Maricaulis salignorans. I should have done blastp, not only interproscan. Perhaps it's better to remove this predicted protein?
Is the blastp match consistent over majority of the protein sequence excluding this bit at the end? Perhaps that part of the prediction is incorrect.
The identity between the two proteins is only 36% and they are much different in sizes (63 aa vs 1118 aa). Actually it is not only one model that got the problem. Took some more for blastp and they seem correct with nice alignment to other proteins, but still contain a tiny tail of 2-3 aa at the end to generate the warning. I guess it is something to do with the genome as it is only at draft stage. But for the weird one, perhaps I can put it as a hypothetical protein and leave it there in the annotation?
So you have not done any diligence to check/correct for these errors? Obviously the software is wrong in this case.
The pipeline does blastp/interproscan/blastn/BUSCO before choosing the best models. So I thought it is sufficient. Plus I don't really know how to check the models. May you suggest?
To confirm is this the kind of data you are processing?
Did you pre-filter your SPAdes contigs to eliminate small/redundant pieces? This pipeline must be taking all sequences you provided to it when doing predictions. Did you look at the logs to see what the BUSCO results were suggesting (in terms of completeness of sequence)?
Correct annotation is hard. Pipeline did what it was programmed to do but the results have to be vetted. Does the genome you are submitting have close relatives available? Were those considered in annotation comparison?
Yes, I did nucmer to remove repetitive contigs <500 bp before feeding the genome to FUNGAP, but I did not look at the BUSCO results though since I thought FUNGAP would sieve out those small and highly fragmented models.
There is a genome of close relatives. I think the plan is .. I will take a look at the BUSCO results, search for those with short exon, and kick those with low completeness out if they are not present in the genome of the closely related species. :) Please suggest if there are additional steps I should consider.