Entering edit mode
4.5 years ago
jamie.pike
▴
80
I have recently run ab initio prediction using Augustus and now want to filter the output (see below). I would like to filter based on the length of the aa sequence, i.e. <30aa is excluded from future work. I intended to just filter using awk, however, I cannot find anything that indicates the size of the predicted protein sequence in any of the fields. Does anyone have any suggestions for filtering this GTF based on the length of the protein sequence?
The columns (fields) contain:
seqname source feature start end score strand frame transcript and gene name
# ----- prediction on sequence number 1 (length = 5210, name = AGND01000115.1:654099-659309(+)) -----
#
# Predicted genes for sequence number 1 on both strands
# start gene g1
AGND01000115.1:654099-659309(+) AUGUSTUS gene 979 2277 0.76 - . g1
AGND01000115.1:654099-659309(+) AUGUSTUS transcript 979 2277 0.76 - . g1.t1
AGND01000115.1:654099-659309(+) AUGUSTUS stop_codon 979 981 . - 0 transcript_id "g1.t1"; ge
ne_id "g1";
AGND01000115.1:654099-659309(+) AUGUSTUS CDS 979 1071 0.99 - 0 transcript_id "g1.t1"; gene_id "g
1";
AGND01000115.1:654099-659309(+) AUGUSTUS CDS 1120 1859 0.78 - 2 transcript_id "g1.t1"; gene_id "g
1";
AGND01000115.1:654099-659309(+) AUGUSTUS CDS 1905 2277 0.98 - 0 transcript_id "g1.t1"; gene_id "g
1";
AGND01000115.1:654099-659309(+) AUGUSTUS start_codon 2275 2277 . - 0 transcript_id "g1.t1"; ge
ne_id "g1";
# protein sequence = [MPRAHDHFHGRHYHAERATGPVKSLNPTKRYLIADRKPLHAESDAGKESRPSAESPGVAYVWRSRDNRKGRHALVISV
# DPRKHDATKAPRPSNSYHQTLRGILKMFVRYPVWDVSYDVAIVFTIGSIIWVINGFFSWLPVLNPSTKFSDWAGGLTAFIGATVFEFGSILLMLEAVN
# ENRADCFGWAVEESIDGMLHLTHADNCKHAHAHKGTFVKQSSKTLDNNTTESAGNDRMWSWWPTWYELRSHYFFDIGFLACSSQTFGATVFWISGFTA
# LPPILNNLSTPAENGVYWLPQVIGGTGFIVSSTLFMVEVQPRWYIPAPGVLGWHIGLWNLIGAIGFTLCGALGFGITHPGVEYALTLSTFIGSWAFLI
# GSVIQWYESLNKYPIWVDQKIERLGKRKS]
# end gene g1
###
why not extract all protein sequences in a separate fasta file and filter that one on length (using seqkit or such) ?
(it's not straightforward to extract the length info directly from the augustus output)
You can filter the GFF by the ORF length using
agat_sp_filter_by_ORF_size.pl
from AGAT