remove UTR annotations from gff
2
I have UTR annotations which I need to remove for a set of genes, which I could do just by deleting but then it would not update the start and stop positions of the gene and mRNA features, and correct exon features. Is there a tool to remove UTR features from a gff file and make the necessary corrections to the remaining mRNA and gene, and exon features? I can't seem to find something but there must be??
gff
• 2.8k views
•
link
updated 5.5 years ago by
Ace
▴
90
•
written 5.5 years ago by
rob234king
▴
610
Hi rob234king ,
I think you can try gff3_file_UTR_trimmer.pl from PASApipeline:
(edited: changed the example to the gff3 in PASA directory; still thinking the meaning of a shift of phase, not a copy-paste problem)
$ cat test.gff3
gi|68711 TIGR gene 12923 14228 . + . ID=68711.t00017;Name=my_protein_name
gi|68711 TIGR mRNA 12923 14228 . + . ID=model.68711.m00017;Parent=68711.t00017
gi|68711 TIGR five_prime_utr 12923 13029 . + . ID=utr5p_of_68711.m00017;Parent=68711.m00017
gi|68711 TIGR exon 12923 13060 . + . ID=68711.e00069;Parent=model.68711.m00017
gi|68711 TIGR CDS 13030 13060 . + 0 ID=13030_13060cds_of_68711.m00017;Parent=model.68711.m00017
gi|68711 TIGR exon 13411 13550 . + . ID=68711.e00070;Parent=model.68711.m00017
gi|68711 TIGR CDS 13411 13550 . + 1 ID=13411_13550cds_of_68711.m00017;Parent=model.68711.m00017
gi|68711 TIGR exon 13677 13802 . + . ID=68711.e00071;Parent=model.68711.m00017
gi|68711 TIGR CDS 13677 13802 . + 0 ID=13677_13802cds_of_68711.m00017;Parent=model.68711.m00017
gi|68711 TIGR exon 13876 14228 . + . ID=68711.e00072;Parent=model.68711.m00017
gi|68711 TIGR CDS 13876 14016 . + 0 ID=13876_14016cds_of_68711.m00017;Parent=model.68711.m00017
gi|68711 TIGR three_prime_utr 14017 14228 . + . ID=utr3p_of_68711.m00017;Parent=68711.m00017
$ perl ~/src/PASApipeline-v2.3.3/misc_utilities/gff3_file_UTR_trimmer.pl test.gff3
gi|68711 TIGR gene 13030 14016 . + . ID=68711.t00017.1;Name=my_protein_name
gi|68711 TIGR mRNA 13030 14016 . + . ID=model.68711.m00017;Parent=68711.t00017.1;Name=my_protein_name
gi|68711 TIGR exon 13030 13060 . + . ID=model.68711.m00017.exon1;Parent=model.68711.m00017
gi|68711 TIGR CDS 13030 13060 . + 0 ID=cds.model.68711.m00017;Parent=model.68711.m00017
gi|68711 TIGR exon 13411 13550 . + . ID=model.68711.m00017.exon2;Parent=model.68711.m00017
gi|68711 TIGR CDS 13411 13550 . + 2 ID=cds.model.68711.m00017;Parent=model.68711.m00017
gi|68711 TIGR exon 13677 13802 . + . ID=model.68711.m00017.exon3;Parent=model.68711.m00017
gi|68711 TIGR CDS 13677 13802 . + 0 ID=cds.model.68711.m00017;Parent=model.68711.m00017
gi|68711 TIGR exon 13876 14016 . + . ID=model.68711.m00017.exon4;Parent=model.68711.m00017
gi|68711 TIGR CDS 13876 14016 . + 0 ID=cds.model.68711.m00017;Parent=model.68711.m00017
•
link
5.5 years ago by
AK
★
2.2k
It seems simplest to just use grep for this.
cat $gff | grep -v "five_prime_UTR" | grep -v "three_prime_UTR" > $new_gff
You could probably even just use
cat $gff | grep -v "_UTR"
and testing is as easy as
cat $gff | grep -v "_UTR" | cut -f 3 | grep -v "#" | sort -u
cat $gff | grep "_UTR" | cut -f 3 | grep -v "#" | sort -u
wherein the first diagnostic should contain know UTR categories and the last should be only UTR categories
•
link
5.5 years ago by
Ace
▴
90
Login before adding your answer.
Traffic: 1984 users visited in the last hour
great thanks, I'll give it a go
It worked perfectly, once I had the right perl version running for it.
Weird that the phase of the second CDS have changed from 1 to 2 during the process. A bug or copy past problem?