I have a gff3 file which has complete length sequence. But, few of the complete sequences have multiple UTRs. I wish to filter them out. Is there any utility that is available ?
scaffold105size588288 transdecoder gene 130390 132407 . + .
scaffold105size588288 transdecoder mRNA 130390 132407 . + .
scaffold105size588288 transdecoder five_prime_UTR 130390 130818 . + .
scaffold105size588288 transdecoder exon 130390 132407 . + .
scaffold105size588288 transdecoder CDS 130819 131979 . + 0
scaffold105size588288 transdecoder three_prime_UTR 131980 132407 . + .scaffold105size588288 transdecoder gene 278652 281390 . + .
scaffold105size588288 transdecoder mRNA 278652 281390 . + .
scaffold105size588288 transdecoder five_prime_UTR 278652 278776 . + .
scaffold105size588288 transdecoder exon 278652 278847 . + .
scaffold105size588288 transdecoder CDS 278777 278847 . + 0
scaffold105size588288 transdecoder exon 279283 280020 . + .
scaffold105size588288 transdecoder CDS 279283 279589 . + 1
scaffold105size588288 transdecoder exon 280311 280393 . + .
scaffold105size588288 transdecoder three_prime_UTR 280311 280393 . + .
scaffold105size588288 transdecoder three_prime_UTR 280593 280678 . + .scaffold105size588288 transdecoder three_prime_UTR 280757 280812 . + .
In this trimmed example, I need to remove the second gene set as it has 3 3'UTRs and retain the first one, which is more a complete set.
Thanks in advance.
Select those that have column 4 == "gene". Please use google to find solution for this using
awk
orsed
.