Is there a tool that sorts gtf files?
6
2
Entering edit mode
6.7 years ago
JJ ▴ 710

Hi all,

I am looking for a tool that sorts gtf annotation files. Can anyone recommend one? I only came across tools that sort gff files like bedtools or gt.

I am grateful for any suggestions!

Thanks JJ

RNA-Seq genome • 25k views
ADD COMMENT
1
Entering edit mode

Hello,

at what criteria do you want to sort and why? The columns you can sort with the standard unix command sort.

fin swimmer

ADD REPLY
1
Entering edit mode

actually I want to add some annotations to a standard annotation gtf file and then use the standard sorting to put the newly added annotations at their "proper" place.

I was thinking of Stringtie --merge as an alternative but as the annotation file and the new annotations are non-redudant I figured a simple sort should also do the trick.

ADD REPLY
2
Entering edit mode

I want to add some annotations to a standard annotation gtf file and then use the standard sorting to put the newly added annotations at their "proper" place

I had to do the same exact thing not long ago, here is the full recipe just in case it might help ;-)

ADD REPLY
6
Entering edit mode
6.7 years ago
erwan.scaon ▴ 950

I recommand to sort with the tool "gff3sort", given that with stardard unix sort, lines with the same chromosomes and start positions will be placed randomly.

gff3sort avoid this pitfall.
For example :

# Sort your gtf/gff & bgzip it
gff3sort.pl --precise --chr_order natural file.gtf/gff | bgzip > file.gtf/gff.gz;

# Create associated index
tabix -p gff file.gtf/gff.gz;
ADD COMMENT
1
Entering edit mode

still have to encounter to first useful example of using this gff3sort over normal linux sort

Find it as well surprising that this gets published while others are struggling to get real interesting biological stuff published ....

ADD REPLY
1
Entering edit mode

Hello erwan,

I recommand to sort with the tool "gff3sort", given that with stardard unix sort, lines with the same chromosomes and start positions will be placed randomly.

gff3sort avoid this pitfall.

could you please explain why this should be a pitfall? If there are more criteria for sorting I have to define them in some way.

fin swimmer

ADD REPLY
2
Entering edit mode

I think that what they mean is that a GFF file may need to be sorted by a column where the values are not ordered lexicographically or numerically. For example: mRNA needs to precede exon, and CDS may need to come after exon.

That being said a gff3sort should be a tool that creates the extra columns, translating the values to sortable ones, then a user should use sort directly. It is unlikely that a gff3sort written in perl would be able to compete in performance and features with a standard unix sort.

ADD REPLY
0
Entering edit mode

with a bit of reasoning (and knowledge of gff format) all things you can easily achieve with linux sort ;-)

and +1 for the performance comment!

ADD REPLY
0
Entering edit mode

does this tool also acept gtf? I can only see gff as input stated.

ADD REPLY
2
Entering edit mode

I assumed it would accept GTF when posting (after all GTF is "GFF2.5", which is really close to GFF3), but since you asked I did a quick check :

Not knowing what your GTF look like, I took a random example : I ran the gff3sort tool on both the GTF & GFF3 of the M16 comprehensive gene annotation. There was no errors. I then loaded tracks into IGV & both displayed just fine, which is another good sign. So you should give it a try with your own GTF.

I case you want to re-run the verification :

git clone https://github.com/billzt/gff3sort.git;
cd gff3sort;
axel -q ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M16/gencode.vM16.chr_patch_hapl_scaff.annotation.gtf.gz;
axel -q ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M16/gencode.vM16.chr_patch_hapl_scaff.annotation.gff3.gz;
unpigz *.gz;
perl ./gff3sort.pl --precise --chr_order natural gencode.vM16.chr_patch_hapl_scaff.annotation.gff3 | bgzip > toto.gff3.gz;
tabix -p gff toto.gff3.gz;
perl ./gff3sort.pl --precise --chr_order natural gencode.vM16.chr_patch_hapl_scaff.annotation.gtf | bgzip > toto.gtf.gz;
tabix -p gff toto.gtf.gz;

IGV check :

igv

ADD REPLY
0
Entering edit mode

I think the main difference between gff and gtf is the parent tag - I have it in the gff but not in the gtf. I downloaded the main annotation files in both formats from gencode. hence, gff3sort.pl would not work on the gtf properly either...

ADD REPLY
5
Entering edit mode
4.9 years ago
ATpoint 86k

awk one-liner:

awk '$1 ~ /^#/ {print $0;next} {print $0 | "sort -k1,1 -k4,4n -k5,5n"}' in.gtf > out_sorted.gtf
ADD COMMENT
4
Entering edit mode
6.7 years ago

gff3sort.pl seems to make sure lines having no "Parent=" attribute comes before those having it, if chrom and start position are the same. I think with unix standard program it should go like this:

$ (grep -v "Parent=" sortme.gtf;grep "Parent=" sortme.gtf)| sort -k1,1 -k4,4n -s

EDIT:

Should'nt we have to be sure that within these two groups the 5th column is sorted as well? If so, we have to expand the command a little bit:

(grep -v "Parent=" sortme.gff|sort -k1,1 -k4,4n -k5,5n;grep "Parent=" sortme.gff|sort -k1,1 -k4,4n -k5,5n)| sort -k1,1 -k4,4n -s

If more speed is required we can use gnu parallel.

parallel ::: 'grep -v "Parent=" sortme.gff|sort -k1,1 -k4,4n -k5,5n' 'grep "Parent=" sortme.gff|sort -k1,1 -k4,4n -k5,5n' | sort -k1,1 -k4,4n -s

fin swimmer

ADD COMMENT
2
Entering edit mode

Thanks for this. However, I don't have the "Parent=" tag in the gtf file - I downloaded the main annotation file from gencode. Hence, your solution does not work for me ... any other suggestions? thanks!

ADD REPLY
2
Entering edit mode
15 months ago
alejandrogzi ▴ 140

Hi! I recently developed this tool: gtfsort, a chr/pos/feature GTF2.5-3 sorter using a using a lexicographically-based index ordering algorithm. I benchmark the results of this tool with other tools presented in this post and gtfsort outperforms all of them. Currently accepts only 2.5 and 3 GTF formats (in the future will support any given custom format).

ADD COMMENT
1
Entering edit mode

Nice work! Your figures, comparisons and descriptions are very nice.

ADD REPLY
0
Entering edit mode

Thanks Juke34! I was impressed by all the capabilities and functionality of AGAT!

ADD REPLY
1
Entering edit mode
4.9 years ago
Juke34 9.0k

This blog talks about it: https://zhiganglu.com/post/sort-gff-topologically/

As I explain here you can use AGAT

The script to use is agat_sp_gxf_to_gff3.pl agat_convert_sp_gxf2gxf.pl
You will have to play with the parameter -gvo to get back a gtf (Bioperl formated) as output.

There is also this script that came up later in AGAT that should do the job directly:
agat_convert_sp_gff2gtf.pl
It can take GFF or GTF input files

ADD COMMENT
0
Entering edit mode

Is this still the latest tool to use within your AGAT suite, I am using Version: v0.4.0, I cannot seem to find agat_sp_gxf_to_gff3.pl at all. At your own website, the script used is different as shown below - agat_convert_sp_gxf2gxf.pl --gff test.gff.

Since I am using the same version of AGAT as used in your example, I suppose I could simply execute

agat_convert_sp_gxf2gxf.pl --gvi 2 --gvo 2 --gff IN.gff -o OUT.gff. Am I right?

ADD REPLY
1
Entering edit mode

When asking GTF format from agat_convert_sp_gxf2gxf.pl, it is the Bioperl converter that is used. Currently this converter is not perfect. I plan to fix the problem in Bioperl one day.
The best is to use agat_convert_sp_gff2gtf.pl. You can find a comparison to other tools here I suggest you install the last version of AGAT from the master branch, there is some fixes lying around. I should update AGAT to v0.4.1.

ADD REPLY
0
Entering edit mode
4.9 years ago
kvshamsudheen ▴ 120

One could use the method as explained here . Just referring the same below

wget --no-check-certificate https://raw.github.com/ctokheim/PrimerSeq/master/gtf.py -O gtf.py  # get command line script
$ python gtf.py -c your_gtf_file.gtf  # check if GTF is sorted
your_gtf_file.gtf is not correctly sorted. please sort before use.
$ python gtf.py -i your_gtf_file.gtf -o your_gtf_file.sorted.gtf  # GTF was not sorted, so sort it
  
ADD COMMENT
0
Entering edit mode

Just a heads up: this script was last updated in April 2016 as opposed to gff3sort.pl, which was updated in Feb 2019. It is definitely not a sure measure of relevance, but if I had to pick, I'd go for the latest.

ADD REPLY

Login before adding your answer.

Traffic: 2083 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6