Dear all,
I want to filter the "RefSeq Select" transcript isoforms in the GRCh37.p13 human genome annotation gff (GCF_000001405.25_GRCh37.p13_genomic.gff.gz). Specifically my goal is to retain for each gene a transcript isoform with a tag=RefSeq Select
attribute if exists (as I understood I can consider them canonical), and in case none is present, the longest isoform.
My idea was to use @Juke34's AGAT tools. I planned to
- filter the longest isoforms
- filter a second gff for the tag using
agat_sp_filter_feature_by_attribute_presence.pl
->docu, - create an ID list of the genes contained in that second gff
- remove those IDs from the longest isoform gff
- merge the resulting gff with the filtered
On the one hand, the number of steps in this strategy makes it a bit delicate, on the other hand, it falls flat at step 2 anyway:
# get gff (~20mb)
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz
# build toy set
zgrep "gene=KRAS;" GCF_000001405.25_GRCh37.p13_genomic.gff.gz \
>kras_excerpt.gff
agat_sp_filter_feature_by_attribute_presence.pl \
--gff kras_excerpt.gff \
--attribute tag \
--flip \
--output test_flip
# empty result
cat test_flip
##gff-version 3
It's quite likely my googlefoo failed me this time: Does anybody have some recommendation how to approach this? Or is the only solution writing yet another gff processor tool?
Juke34 identified the original problem to the second step, I didn't add
--type level2
, so it correctly filtered out all level one features not containing the tag, like genes, etc. Lovely gff world