Tutorial:gene set filter/selection for training ab initio annotation tools
1
9
Entering edit mode
5.3 years ago
Juke34 8.9k

Here I show how to filter an annotation (often evidence-based) to create a golden set of genes in order to train ab-initio annotation tools.

(What is described here can be automatically done using this NBIS Nextflow pipeline)

Prerequisite:

  • an annotation in gff or gtf
  • AGAT (Another Gff Analysis Toolkit)
  • blast
  • to make it easier a nice folder structure:
    • mkdir filter
    • mkdir protein
    • mkdir nonredundant
    • mkdir blast_recursive

1. Select protein coding genes

First we select only the protein coding genes from the annotation file and remove all other type of level2 features: tRNA, snoRNA, etc.

agat_sp_separate_by_record_type.pl --gff annotation_evidence_based.gff -o splitByLevel2  
ln -s splitByLevel2/mrna.gff

For older AGAT's version agat_sp_separate_by_record_type.pl was called agat_sp_split_by_level2_feature.pl

2. Optional (Only if Maker annotation): filter by score

Next step, we need to filter the best genes. We will keep gene with AED score under a certain score (e.g 0.3).

agat_sp_filter_feature_by_attribute_value.pl --gff mrna.gff  --value 0.3 -a _AED -t ">=" -o filter/codingGeneFeatures.filter.gff

3. Longest isoform per mRNA

Now we filter mRNAs to keep only the longest isoform when several are avaialble.

agat_sp_keep_longest_isoform.pl --gff filter/codingGeneFeatures.filter.gff -o filter/codingGeneFeatures.filter.longest_cds.gff

4. Complete genes

We then check that the gene models are complete (has a start and stop codon), and remove the incomplete ones.

agat_sp_filter_incomplete_gene_coding_models.pl --gff filter/codingGeneFeatures.filter.longest_cds.gff -f genome.fa -o filter/codingGeneFeatures.filter.longest_cds.complete.gff

5. Distance from neighboring genes

We can also filter the gene models by distance from neighboring genes in order to be sure to have nice intergenic region wihtout genes in it in order to train properly intergenic parameters. (default 500 bp)

agat_sp_filter_by_locus_distance.pl --gff filter/codingGeneFeatures.filter.longest_cds.complete.gff -o filter/codingGeneFeatures.filter.longest_cds.complete.good_distance.gff

6. Remove redundancy

To avoid to bias the training and give an exhaustive view of the diversity of gene we have to remove the ones that are too similar to each other. In order to do so, we translate our coding genes into proteins, format the protein fasta file to be able to run a recursive blast and then filter them.

agat_sp_extract_sequences.pl --gff filtercodingGeneFeatures.filter.longest_cds.complete.good_distance.gff -f genome.fa --aa -o protein/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa
makeblastdb -in protein/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa -dbtype prot
blastp -query protein/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa -db 
protein/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa -outfmt 6 -out blast_recursive/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa.blast_recursive
agat_sp_filter_by_mrnaBlastValue.pl --gff filter/codingGeneFeatures.filter.longest_cds.complete.good_distance.gff --blast blast_recursive/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa.blast_recursive --outfile nonredundant/codingGeneFeatures.nr.gff

Voila, nonredundant/codingGeneFeatures.nr.gff contains your nice gene set.

You need several hundred genes to train properly an ab initio tool. The more you have, the better is... but you kind of reach a plateau around 700 genes.

geneset genome gene abinitio • 7.8k views
ADD COMMENT
1
Entering edit mode

Hi Jacques, Thank you for sharing your protocol. I just wanted to check with you step 5. Should you not request agat_sp_extract_sequences.pl to translate using the --aa switch? Thanks in advance

ADD REPLY
0
Entering edit mode

Yes you're right, thank you for pointing it.

ADD REPLY
0
Entering edit mode

Hi, I have the below GFF3 file and I would like to remove mRNAs which do not have a description i.e. do not contain Note. Does AGAT remove those mRNAs? If not then do you know any other tool?

NbV1Ch08        AUGUSTUS        gene    60876   63944   0.03    +       .       ID=g2
NbV1Ch08        AUGUSTUS        mRNA    60876   63944   0.03    +       .       ID=g2.t1;Note=B3 domain-containing protein Os03g0120900;Parent=g2
NbV1Ch08        AUGUSTUS        transcription_start_site        60876   60876   .       +       .       Parent=g2.t1
NbV1Ch08        AUGUSTUS        five_prime_utr  60876   61072   0.19    +       .       ID=g2.t1.5UTR1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        exon    60876   61072   .       +       .       ID=g2.t1.exon1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        five_prime_utr  61673   61732   0.37    +       .       ID=g2.t1.5UTR2;Parent=g2.t1
NbV1Ch08        AUGUSTUS        exon    61673   63449   .       +       .       ID=g2.t1.exon2;Parent=g2.t1
NbV1Ch08        AUGUSTUS        start_codon     61733   61735   .       +       0       Parent=g2.t1
NbV1Ch08        AUGUSTUS        CDS     61733   62974   0.54    +       0       ID=g2.t1.CDS1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        stop_codon      62972   62974   .       +       0       Parent=g2.t1
NbV1Ch08        AUGUSTUS        three_prime_utr 62975   63449   1       +       .       ID=g2.t1.3UTR1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        three_prime_utr 63565   63944   0.27    +       .       ID=g2.t1.3UTR2;Parent=g2.t1
NbV1Ch08        AUGUSTUS        exon    63565   63944   .       +       .       ID=g2.t1.exon3;Parent=g2.t1
NbV1Ch08        AUGUSTUS        transcription_end_site  63944   63944   .       +       .       Parent=g2.t1
NbV1Ch08        AUGUSTUS        gene    64722   65524   0.32    -       .       ID=g3
NbV1Ch08        AUGUSTUS        mRNA    64722   65524   0.32    -       .       ID=g3.t1;Parent=g3
NbV1Ch08        AUGUSTUS        transcription_end_site  64722   64722   .       -       .       Parent=g3.t1
NbV1Ch08        AUGUSTUS        three_prime_utr 64722   64792   0.77    -       .       ID=g3.t1.3UTR1;Parent=g3.t1
NbV1Ch08        AUGUSTUS        exon    64722   65524   .       -       .       ID=g3.t1.exon1;Parent=g3.t1
NbV1Ch08        AUGUSTUS        stop_codon      64793   64795   .       -       0       Parent=g3.t1
NbV1Ch08        AUGUSTUS        CDS     64793   65494   0.44    -       0       ID=g3.t1.CDS1;Parent=g3.t1
NbV1Ch08        AUGUSTUS        start_codon     65492   65494   .       -       0       Parent=g3.t1

Thank you in advance.

ADD REPLY
0
Entering edit mode

I think I have not develop any script to perform such filtering within the AGAT toolkit, because a simple awk command can do the trick:

awk '{if(!($3=="mRNA"  && $9 ~ "Note"))print $0}' file.gff > filtered_file.gff
ADD REPLY
0
Entering edit mode

Thank you. However, using the above command will ignore the below sub features:

transcription_start_site
five_prime_utr
exon
five_prime_utr
exon
start_codon
CDS
stop_codon
three_prime_utr
three_prime_utr
exon
transcription_end_site
ADD REPLY
0
Entering edit mode

What do you mean? Those features should still be in the output.

ADD REPLY
0
Entering edit mode

I would like to keep those features because they mRNA contain Note (writing into keep.gff3 file)

NbV1Ch08        AUGUSTUS        gene    60876   63944   0.03    +       .       ID=g2
NbV1Ch08        AUGUSTUS        mRNA    60876   63944   0.03    +       .       ID=g2.t1;Note=B3 domain-containing protein Os03g0120900;Parent=g2
NbV1Ch08        AUGUSTUS        transcription_start_site        60876   60876   .       +       .       Parent=g2.t1
NbV1Ch08        AUGUSTUS        five_prime_utr  60876   61072   0.19    +       .       ID=g2.t1.5UTR1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        exon    60876   61072   .       +       .       ID=g2.t1.exon1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        five_prime_utr  61673   61732   0.37    +       .       ID=g2.t1.5UTR2;Parent=g2.t1
NbV1Ch08        AUGUSTUS        exon    61673   63449   .       +       .       ID=g2.t1.exon2;Parent=g2.t1
NbV1Ch08        AUGUSTUS        start_codon     61733   61735   .       +       0       Parent=g2.t1
NbV1Ch08        AUGUSTUS        CDS     61733   62974   0.54    +       0       ID=g2.t1.CDS1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        stop_codon      62972   62974   .       +       0       Parent=g2.t1
NbV1Ch08        AUGUSTUS        three_prime_utr 62975   63449   1       +       .       ID=g2.t1.3UTR1;Parent=g2.t1
NbV1Ch08        AUGUSTUS        three_prime_utr 63565   63944   0.27    +       .       ID=g2.t1.3UTR2;Parent=g2.t1
NbV1Ch08        AUGUSTUS        exon    63565   63944   .       +       .       ID=g2.t1.exon3;Parent=g2.t1
NbV1Ch08        AUGUSTUS        transcription_end_site  63944   63944   .       +       .       Parent=g2.t1

On the other hand, I would like to reject those features because mRNA does not contain Note (writing into reject.gff3 file)

NbV1Ch08        AUGUSTUS        gene    64722   65524   0.32    -       .       ID=g3
NbV1Ch08        AUGUSTUS        mRNA    64722   65524   0.32    -       .       ID=g3.t1;Parent=g3
NbV1Ch08        AUGUSTUS        transcription_end_site  64722   64722   .       -       .       Parent=g3.t1
NbV1Ch08        AUGUSTUS        three_prime_utr 64722   64792   0.77    -       .       ID=g3.t1.3UTR1;Parent=g3.t1
NbV1Ch08        AUGUSTUS        exon    64722   65524   .       -       .       ID=g3.t1.exon1;Parent=g3.t1
NbV1Ch08        AUGUSTUS        stop_codon      64793   64795   .       -       0       Parent=g3.t1
NbV1Ch08        AUGUSTUS        CDS     64793   65494   0.44    -       0       ID=g3.t1.CDS1;Parent=g3.t1
NbV1Ch08        AUGUSTUS        start_codon     65492   65494   .       -       0       Parent=g3.t1
ADD REPLY
0
Entering edit mode

Ok I understand, then awk is not well suited. Unfortunalty I do not have script yet to perform this task.

ADD REPLY
0
Entering edit mode

It seems to work for now after sed 's|,AT.*-Protein;||g' TAIR10_GFF3_genes-fix1.rm_rubbish_rm_protein.gff > TAIR10_GFF3_genes-fix1.rm_rubbish_rm_protein_rm_id.gff.

ADD REPLY
0
Entering edit mode

It seems you mixed up your answer with that post isn't it?

ADD REPLY
0
Entering edit mode

Thank you for noticing.

ADD REPLY
0
Entering edit mode

I have now a script to perform this task properly in AGAT called agat_sp_filter_feature_by_attribute_presence.pl

ADD REPLY
0
Entering edit mode

Hi Thanks for your guide !

I am working to get a nice set of gene after my 1st round of maker. In that case, what is annotation_evidence_based.gff ?

ADD REPLY
2
Entering edit mode

It's your output of maker after using:

...../src/maker/bin/gff3_merge -d your_name_master_datastore_index.log

He called it evidence_based because your first maker run should be using protein and RNA (ESTs) evidence to detect genes. The next maker round you do will be without these evidences (you should disable them) and only with the model .hmm file that you create with SNAP or Augustus.

Cheers,

ADD REPLY
0
Entering edit mode

I didn't even realize this was necessary. The maker tutorials don't mention it at all, they just proceed to SNAP modelling with all genes, using only "fathom" as quality control. Does this make a very big difference in results? Where can I learn about it?

Cheers,
P.S: Thank you so much for sharing your knowledge so well!

ADD REPLY
1
Entering edit mode

By experience the approach I mention here give quite similar results (at least for Augustus), excepted it needs only one round of MAKER, that make the approach much more efficient in term of time.

I have seen you message on the MAKER mailing list, you can use alt_ESTs only but not with est2genome=1 because no gene prediction are made with alt_ESTs (it is not obvious but written somewhere in one of their paper). What you can do is to use alt_EST data as EST too. I often do that, it might help if it is not too diverged.But you will not have many prediction ET-based like that.

ADD REPLY
0
Entering edit mode

Hi Juke34,

Do you have a guide about how do you train nonredundant/codingGeneFeatures.nr.gff with a gene predictor like AUGUSTUS ? thanks

ADD REPLY
0
Entering edit mode

Look at the pipeline I mention below you can find the commands needed to train Augustus from the nr file

ADD REPLY
0
Entering edit mode

Thanks Juke, I was wondering if this command (from your tutorial):

new_species.pl --species=$species_label
etraining --species=$species_label $training_file
augustus --species=$species_label $test_file | tee ${species_label}_run.log

does the same thing than :

autoAug.pl --genome=genome.fa --species=species_name --trainingset=codingGeneFeatures.nr.gff --singleCPU -v 

did you have the opportunity to test ?

ADD REPLY
0
Entering edit mode

I think the training set must be in Genbank format, you have to convert it first

ADD REPLY
0
Entering edit mode

Hi do you have a command that filter out genes < x AA ? because even after all these steps I have some genes (very minor) that are very short and I think these are false positives.

ADD REPLY
0
Entering edit mode

agat_sp_filter_gene_by_length.pl from AGAT

ADD REPLY
0
Entering edit mode

Hi Juke,

I am wondering whether it is possible to output codingGeneFeatures.nr.gff (from step 5) to be .fasta so that one can use it in busco or Maker or Braker?

I am aware that you will update the post here (I wrote to you on Github already about that).

I will appreciate your feedback. Thank you

ADD REPLY
0
Entering edit mode

Hi, Using agat_sp_extract_sequences.pl (Extracting genomic feature sequences from GTF/GFF files with AGAT), based on the annotation gff+genome fasta you can extract the protein in fasta format.

ADD REPLY
0
Entering edit mode

Hi Juke,

Given that training Augustus in Busco for example requires a fasta file as input (see command below), I am wondering, how do you proceed from nonredundant/codingGeneFeatures.nr.gff (results from step) to train these gene sets in Busco?

Augustus/Busco command:

${BUSCO_PATH}/busco -i ${GENOME} -o ${Prefix} -l ${BUSCODB} -m genome -c ${ThreadN} -f --long --augustus_parameters='--progress=true' 
ADD REPLY
0
Entering edit mode

I don't get what you want to achieve, could you be more explicit? What version of Busco are you using?

ADD REPLY
0
Entering edit mode
4.7 years ago
Juke34 8.9k

For whom interested we also have a nextflow pipeline for it that goes even beyond ( does the training too ).
You can find it here: https://github.com/NBISweden/pipelines-nextflow/tree/master/AbinitioTraining

ADD COMMENT
0
Entering edit mode

Thanks for your guide.

Can you explain what are those parameters ?

params.test_size = 100
params.flank_region_size = 1000
ADD REPLY
0
Entering edit mode

params.test_size = 100

It is the number of model/gene used for the test, the rest being used for the training

params.flank_region_size = 1000

it is used during the conversion from GFF to genbank by gff2gbSmallDNA.pl. It is the max-size-of-gene-flanking-DNA parameter. It is used to have intergenic regions in your dataset. It must be =< of the locus_distance used by agat_sp_filter_by_locus_distance.pl to be sure to not have any coding part in the intergenic region.

ADD REPLY
0
Entering edit mode

Then I think there is a mistake in the pipeline because you set

params.locus_distance = 3000

and then after

params.flank_region_size = 1000
ADD REPLY
1
Entering edit mode

Sorry I meant flank_region_size < locus_distance

ADD REPLY

Login before adding your answer.

Traffic: 1609 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6