Here I show how to filter an annotation (often evidence-based) to create a golden set of genes in order to train ab-initio annotation tools.
(What is described here can be automatically done using this NBIS Nextflow pipeline)
Prerequisite:
- an annotation in gff or gtf
- AGAT (Another Gff Analysis Toolkit)
- blast
- to make it easier a nice folder structure:
- mkdir filter
- mkdir protein
- mkdir nonredundant
- mkdir blast_recursive
1. Select protein coding genes
First we select only the protein coding genes from the annotation file and remove all other type of level2 features: tRNA, snoRNA, etc.
agat_sp_separate_by_record_type.pl --gff annotation_evidence_based.gff -o splitByLevel2
ln -s splitByLevel2/mrna.gff
For older AGAT's version agat_sp_separate_by_record_type.pl
was called agat_sp_split_by_level2_feature.pl
2. Optional (Only if Maker annotation): filter by score
Next step, we need to filter the best genes. We will keep gene with AED score under a certain score (e.g 0.3).
agat_sp_filter_feature_by_attribute_value.pl --gff mrna.gff --value 0.3 -a _AED -t ">=" -o filter/codingGeneFeatures.filter.gff
3. Longest isoform per mRNA
Now we filter mRNAs to keep only the longest isoform when several are avaialble.
agat_sp_keep_longest_isoform.pl --gff filter/codingGeneFeatures.filter.gff -o filter/codingGeneFeatures.filter.longest_cds.gff
4. Complete genes
We then check that the gene models are complete (has a start and stop codon), and remove the incomplete ones.
agat_sp_filter_incomplete_gene_coding_models.pl --gff filter/codingGeneFeatures.filter.longest_cds.gff -f genome.fa -o filter/codingGeneFeatures.filter.longest_cds.complete.gff
5. Distance from neighboring genes
We can also filter the gene models by distance from neighboring genes in order to be sure to have nice intergenic region wihtout genes in it in order to train properly intergenic parameters. (default 500 bp)
agat_sp_filter_by_locus_distance.pl --gff filter/codingGeneFeatures.filter.longest_cds.complete.gff -o filter/codingGeneFeatures.filter.longest_cds.complete.good_distance.gff
6. Remove redundancy
To avoid to bias the training and give an exhaustive view of the diversity of gene we have to remove the ones that are too similar to each other. In order to do so, we translate our coding genes into proteins, format the protein fasta file to be able to run a recursive blast and then filter them.
agat_sp_extract_sequences.pl --gff filtercodingGeneFeatures.filter.longest_cds.complete.good_distance.gff -f genome.fa --aa -o protein/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa
makeblastdb -in protein/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa -dbtype prot
blastp -query protein/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa -db
protein/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa -outfmt 6 -out blast_recursive/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa.blast_recursive
agat_sp_filter_by_mrnaBlastValue.pl --gff filter/codingGeneFeatures.filter.longest_cds.complete.good_distance.gff --blast blast_recursive/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa.blast_recursive --outfile nonredundant/codingGeneFeatures.nr.gff
Voila, nonredundant/codingGeneFeatures.nr.gff
contains your nice gene set.
You need several hundred genes to train properly an ab initio tool. The more you have, the better is... but you kind of reach a plateau around 700 genes.
Hi Jacques, Thank you for sharing your protocol. I just wanted to check with you step 5. Should you not request agat_sp_extract_sequences.pl to translate using the --aa switch? Thanks in advance
Yes you're right, thank you for pointing it.
Hi, I have the below GFF3 file and I would like to remove mRNAs which do not have a description i.e. do not contain
Note
. Does AGAT remove those mRNAs? If not then do you know any other tool?Thank you in advance.
I think I have not develop any script to perform such filtering within the AGAT toolkit, because a simple
awk
command can do the trick:Thank you. However, using the above command will ignore the below sub features:
What do you mean? Those features should still be in the output.
I would like to keep those features because they mRNA contain
Note
(writing into keep.gff3 file)On the other hand, I would like to reject those features because mRNA does not contain
Note
(writing into reject.gff3 file)Ok I understand, then
awk
is not well suited. Unfortunalty I do not have script yet to perform this task.It seems to work for now after
sed 's|,AT.*-Protein;||g' TAIR10_GFF3_genes-fix1.rm_rubbish_rm_protein.gff > TAIR10_GFF3_genes-fix1.rm_rubbish_rm_protein_rm_id.gff
.It seems you mixed up your answer with that post isn't it?
Thank you for noticing.
I have now a script to perform this task properly in AGAT called
agat_sp_filter_feature_by_attribute_presence.pl
Hi Thanks for your guide !
I am working to get a nice set of gene after my 1st round of maker. In that case, what is
annotation_evidence_based.gff
?It's your output of maker after using:
...../src/maker/bin/gff3_merge -d your_name_master_datastore_index.log
He called it evidence_based because your first maker run should be using protein and RNA (ESTs) evidence to detect genes. The next maker round you do will be without these evidences (you should disable them) and only with the model .hmm file that you create with SNAP or Augustus.
Cheers,
I didn't even realize this was necessary. The maker tutorials don't mention it at all, they just proceed to SNAP modelling with all genes, using only "fathom" as quality control. Does this make a very big difference in results? Where can I learn about it?
Cheers,
P.S: Thank you so much for sharing your knowledge so well!
By experience the approach I mention here give quite similar results (at least for Augustus), excepted it needs only one round of MAKER, that make the approach much more efficient in term of time.
I have seen you message on the MAKER mailing list, you can use alt_ESTs only but not with est2genome=1 because no gene prediction are made with alt_ESTs (it is not obvious but written somewhere in one of their paper). What you can do is to use alt_EST data as EST too. I often do that, it might help if it is not too diverged.But you will not have many prediction ET-based like that.
Hi Juke34,
Do you have a guide about how do you train
nonredundant/codingGeneFeatures.nr.gff
with a gene predictor like AUGUSTUS ? thanksLook at the pipeline I mention below you can find the commands needed to train Augustus from the nr file
Thanks Juke, I was wondering if this command (from your tutorial):
does the same thing than :
did you have the opportunity to test ?
I think the training set must be in Genbank format, you have to convert it first
Hi do you have a command that filter out genes < x AA ? because even after all these steps I have some genes (very minor) that are very short and I think these are false positives.
agat_sp_filter_gene_by_length.pl
from AGATHi Juke,
I am wondering whether it is possible to output codingGeneFeatures.nr.gff (from step 5) to be .fasta so that one can use it in busco or Maker or Braker?
I am aware that you will update the post here (I wrote to you on Github already about that).
I will appreciate your feedback. Thank you
Hi, Using
agat_sp_extract_sequences.pl
(Extracting genomic feature sequences from GTF/GFF files with AGAT), based on the annotation gff+genome fasta you can extract the protein in fasta format.Hi Juke,
Given that training Augustus in Busco for example requires a fasta file as input (see command below), I am wondering, how do you proceed from nonredundant/codingGeneFeatures.nr.gff (results from step) to train these gene sets in Busco?
Augustus/Busco command:
I don't get what you want to achieve, could you be more explicit? What version of Busco are you using?