remove UTR annotations from gff
2
0
Entering edit mode
5.5 years ago
rob234king ▴ 610

I have UTR annotations which I need to remove for a set of genes, which I could do just by deleting but then it would not update the start and stop positions of the gene and mRNA features, and correct exon features. Is there a tool to remove UTR features from a gff file and make the necessary corrections to the remaining mRNA and gene, and exon features? I can't seem to find something but there must be??

gff • 2.8k views
ADD COMMENT
2
Entering edit mode
5.5 years ago
AK ★ 2.2k

Hi rob234king,

I think you can try gff3_file_UTR_trimmer.pl from PASApipeline:

(edited: changed the example to the gff3 in PASA directory; still thinking the meaning of a shift of phase, not a copy-paste problem)

$ cat test.gff3
gi|68711    TIGR    gene    12923   14228   .   +   .   ID=68711.t00017;Name=my_protein_name
gi|68711    TIGR    mRNA    12923   14228   .   +   .   ID=model.68711.m00017;Parent=68711.t00017
gi|68711    TIGR    five_prime_utr  12923   13029   .   +   .   ID=utr5p_of_68711.m00017;Parent=68711.m00017
gi|68711    TIGR    exon    12923   13060   .   +   .   ID=68711.e00069;Parent=model.68711.m00017
gi|68711    TIGR    CDS 13030   13060   .   +   0   ID=13030_13060cds_of_68711.m00017;Parent=model.68711.m00017
gi|68711    TIGR    exon    13411   13550   .   +   .   ID=68711.e00070;Parent=model.68711.m00017
gi|68711    TIGR    CDS 13411   13550   .   +   1   ID=13411_13550cds_of_68711.m00017;Parent=model.68711.m00017
gi|68711    TIGR    exon    13677   13802   .   +   .   ID=68711.e00071;Parent=model.68711.m00017
gi|68711    TIGR    CDS 13677   13802   .   +   0   ID=13677_13802cds_of_68711.m00017;Parent=model.68711.m00017
gi|68711    TIGR    exon    13876   14228   .   +   .   ID=68711.e00072;Parent=model.68711.m00017
gi|68711    TIGR    CDS 13876   14016   .   +   0   ID=13876_14016cds_of_68711.m00017;Parent=model.68711.m00017
gi|68711    TIGR    three_prime_utr 14017   14228   .   +   .   ID=utr3p_of_68711.m00017;Parent=68711.m00017

$ perl ~/src/PASApipeline-v2.3.3/misc_utilities/gff3_file_UTR_trimmer.pl test.gff3
gi|68711    TIGR    gene    13030   14016   .   +   .   ID=68711.t00017.1;Name=my_protein_name
gi|68711    TIGR    mRNA    13030   14016   .   +   .   ID=model.68711.m00017;Parent=68711.t00017.1;Name=my_protein_name
gi|68711    TIGR    exon    13030   13060   .   +   .   ID=model.68711.m00017.exon1;Parent=model.68711.m00017
gi|68711    TIGR    CDS 13030   13060   .   +   0   ID=cds.model.68711.m00017;Parent=model.68711.m00017
gi|68711    TIGR    exon    13411   13550   .   +   .   ID=model.68711.m00017.exon2;Parent=model.68711.m00017
gi|68711    TIGR    CDS 13411   13550   .   +   2   ID=cds.model.68711.m00017;Parent=model.68711.m00017
gi|68711    TIGR    exon    13677   13802   .   +   .   ID=model.68711.m00017.exon3;Parent=model.68711.m00017
gi|68711    TIGR    CDS 13677   13802   .   +   0   ID=cds.model.68711.m00017;Parent=model.68711.m00017
gi|68711    TIGR    exon    13876   14016   .   +   .   ID=model.68711.m00017.exon4;Parent=model.68711.m00017
gi|68711    TIGR    CDS 13876   14016   .   +   0   ID=cds.model.68711.m00017;Parent=model.68711.m00017
ADD COMMENT
0
Entering edit mode

great thanks, I'll give it a go

ADD REPLY
0
Entering edit mode

It worked perfectly, once I had the right perl version running for it.

ADD REPLY
0
Entering edit mode

Weird that the phase of the second CDS have changed from 1 to 2 during the process. A bug or copy past problem?

ADD REPLY
1
Entering edit mode
5.5 years ago
Ace ▴ 90

It seems simplest to just use grep for this.

cat $gff | grep -v "five_prime_UTR" | grep -v "three_prime_UTR" > $new_gff

You could probably even just use

cat $gff | grep -v "_UTR"

and testing is as easy as

cat $gff | grep -v "_UTR" | cut -f 3 | grep -v "#" | sort -u

cat $gff | grep "_UTR" | cut -f 3 | grep -v "#" | sort -u

wherein the first diagnostic should contain know UTR categories and the last should be only UTR categories

ADD COMMENT
0
Entering edit mode

But as rob234king said:

I could do just by deleting but then it would not update the start and stop positions of the gene and mRNA features, and correct exon features.

ADD REPLY
0
Entering edit mode

Oh good catch, I must have glossed over that initially. I've written if statements into for loops to do this previously. Logic follows like this:

for gene in list

if strand=+

if UTR_start < gene start

new_Gene_start=UTR_Start

Repeat for end, set - strand conditions, etc etc

but it's a bit bulky and I've never tried to incorporate exons this way. Better to not reinvent the wheel if there's a functional perl script.

ADD REPLY

Login before adding your answer.

Traffic: 1984 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6