GRCh37 GFF filter transcript isoforms by RefSeq Select tag or longest
1
0
Entering edit mode
3.3 years ago
Carambakaracho ★ 3.3k

Dear all,

I want to filter the "RefSeq Select" transcript isoforms in the GRCh37.p13 human genome annotation gff (GCF_000001405.25_GRCh37.p13_genomic.gff.gz). Specifically my goal is to retain for each gene a transcript isoform with a tag=RefSeq Select attribute if exists (as I understood I can consider them canonical), and in case none is present, the longest isoform.

My idea was to use @Juke34's AGAT tools. I planned to

  1. filter the longest isoforms
  2. filter a second gff for the tag using agat_sp_filter_feature_by_attribute_presence.pl ->docu,
  3. create an ID list of the genes contained in that second gff
  4. remove those IDs from the longest isoform gff
  5. merge the resulting gff with the filtered

On the one hand, the number of steps in this strategy makes it a bit delicate, on the other hand, it falls flat at step 2 anyway:

# get gff (~20mb)
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz
# build toy set
zgrep "gene=KRAS;" GCF_000001405.25_GRCh37.p13_genomic.gff.gz \
    >kras_excerpt.gff
agat_sp_filter_feature_by_attribute_presence.pl \
    --gff kras_excerpt.gff \
    --attribute tag \
    --flip \
    --output test_flip
# empty result
cat test_flip
##gff-version 3

It's quite likely my googlefoo failed me this time: Does anybody have some recommendation how to approach this? Or is the only solution writing yet another gff processor tool?

GFF • 2.3k views
ADD COMMENT
0
Entering edit mode

Juke34 identified the original problem to the second step, I didn't add --type level2, so it correctly filtered out all level one features not containing the tag, like genes, etc. Lovely gff world

ADD REPLY
0
Entering edit mode
3.3 years ago
Juke34 8.9k

On your test you keep the feature that do not have the attribute ‘tag’. Either they all have the attribute ‘tag’. My guess is that your Gff file is empty e.i. Does not contain any feature… could you give a check?

ADD COMMENT
0
Entering edit mode

Thanks, I checked that for a handful genes, in particular the excerpt gff. In those examples one transcript isoform has the tag. I can attach the excerpt only later today.

I verified the output without the --flip option and this removed the isoform with the tag, as I understood/expected from the documentation.

ADD REPLY
0
Entering edit mode

Hi Juke34 and others,

[EDIT AFTER THE ANSWERS BELOW] As the details of this discussion are not exactly related to the original question, I moved the content of this post to the AGAT issue tracker

ADD REPLY
0
Entering edit mode

Ok then it might be a bug, or the parser modify the attributes at some point before the treatment. To check the second point could you run the gxf2gxf script that only run the parser and check the output to see if the attribute is appearing at an other place?

ADD REPLY
0
Entering edit mode

Sure - I would take it to the agat issue tracker, then, and maybe even remove a bit from the clutter here. That is, unless you disagree of course.

ADD REPLY
0
Entering edit mode

Sounds good

ADD REPLY

Login before adding your answer.

Traffic: 2579 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6