Here is what I did for running BUSCO v3
python ~/bin/busco/scripts/run_BUSCO.py -i genome.fasta \
-o busco_test -l /eukaryota_odb9/ --mode genome --long
So here is what I did in order to convert gff files produced by augustus into a format that could be read by SNAP. Note that I used bedtools merge because fathom complained that some of the exons were overlapping. I am sure someone could write the code more efficiently and elegantly, but it suffices for the ~250 files I am working with.
cd /directory_with_augustus_gffs/
ls *.gff | perl -pe "s/.gff//g" > samples
while read i;do
file=$i
grep "exon" $i.gff | sort -k 1,1 -k4,4n | ~/bin/bedtools-2.26.0/bin/bedtools merge |\
awk -v OFS='\t' -v b="$file" '
{if ($7 == "-" && NR == 1)
{a="Terminal";print $1,$2,a,$4,$5,$6,$7,$8,b;}}
{if ($7 == "-" && NR>1)
{a="Internal";print $1,$2,a,$4,$5,$6,$7,$8,b;}}
END {if ($7 == "-")
{a="First";print $1,$2,a,$4,$5,$6,$7,$8,b;}}
{if ($7 == "+" && NR==1)
{a="First";print $1,$2,a,$4,$5,$6,$7,$8,b;}}
{if ($7 == "+" && NR>1)
{a="Internal";print $1,$2,a,$4,$5,$6,$7,$8,b;}}
END {if ($7 == "+")
{a="Terminal";print $1,$2,a,$4,$5,$6,$7,$8,b;}}' >> busco_augustus_all.gff
done < samples
#
~/bin/maker/bin/cegma2zff busco_augustus_all.gff genome.fasta
~/bin/maker/exe/snap/fathom genome.ann genome.dna -categorize 1000
~/bin/maker/exe/snap/fathom -export 1000 -plus uni.ann uni.dna
~/bin/maker/exe/snap/forge export.ann export.dna
~/bin/maker/exe/snap/hmm-assembler.pl drom . > ../drom.buscosnap.hmm
Here's an explanation of the code
cd /directory_with_augustus_gffs/
ls *.gff | perl -pe "s/.gff//g" > samples # get all files with the extension .gff, remove the extension with perl and then put them in a file called samples
while read i;do # for each line in samples, read in "i" or the sample name and do the following
file=$i # create the variable file and assign "i" or the current sample name to it
grep "exon" $i.gff | \ # get only the rows with exon in them
sort -k 1,1 -k4,4n | \ # sort the file by chromosome then start position just in case
~/bin/bedtools-2.26.0/bin/bedtools merge |\ # merge overlapping exons if any
awk -v OFS='\t' -v b="$file" ' # initiate awk, make output field separator tabs, and assign the variable $file (the current sample) to the variable b
{if ($7 == "-" && NR == 1)
{a="Terminal";print $1,$2,a,$4,$5,$6,$7,$8,b;}} # if the sequence is on the minus strand and it is the first line, then rename "exon" to "Terminal"
{if ($7 == "-" && NR>1)
{a="Internal";print $1,$2,a,$4,$5,$6,$7,$8,b;}} # if the sequence is on the minus strand and it is not the first line, then rename "exon" to "Internal"
END {if ($7 == "-")
{a="First";print $1,$2,a,$4,$5,$6,$7,$8,b;}} # if the sequence is on the minus strand and it is the last line, then rename "exon" to "First"
{if ($7 == "+" && NR==1)
{a="First";print $1,$2,a,$4,$5,$6,$7,$8,b;}} # if the sequence is on the plus strand and it is the first line, then rename "exon" to "First"
{if ($7 == "+" && NR>1)
{a="Internal";print $1,$2,a,$4,$5,$6,$7,$8,b;}}
END {if ($7 == "+")
{a="Terminal";print $1,$2,a,$4,$5,$6,$7,$8,b;}}' \ # if the sequence is on the minus strand and it is the last line, then rename "exon" to "Terminal"
>> busco_augustus_all.gff # append the output for each iteration to the file busco_augustus_all.gff
done < samples
#
~/bin/maker/bin/cegma2zff busco_augustus_all.gff genome.fasta
~/bin/maker/exe/snap/fathom genome.ann genome.dna -categorize 1000
~/bin/maker/exe/snap/fathom -export 1000 -plus uni.ann uni.dna
~/bin/maker/exe/snap/forge export.ann export.dna
~/bin/maker/exe/snap/hmm-assembler.pl drom . > ../drom.buscosnap.hmm
You should also see the following maker-devel post:
https://groups.google.com/forum/#!topic/maker-devel/vp8R06VVQGQ
This code cannot act as you say. Because gff3 output from BUSCO does not contain 'exon'. And 'bedtools merge' command wipe out several gff3 information like belows.
So, output using your code is empty. Some fix will be needed.
I have the same problem.
I would like convert my busco file to CEGMA file or to zff.
You are right that there is problem in the code. In the Augustus GFF3 files that I tested the code on, there was a field containing
exon
. I need to figure out what version of Augustus and BUSCO I was using because I can't seem to findexon
in the newer GFF3 files from BUSCO v.3.0.1 and Augustus 3.3.2. I fixed thebedtools merge
andawk
commands. I must have done testing withoutbedtools merge
. Note that I haven't tested the newer code with the downstream toolscegma2zff
andfathom
forge
andhmm-assembler.pl
edit
: I was using BUSCO v.3.0.1 and Augustus 3.2.3 to generate GFF3 files withexon
in the column 3,type
edit2:
I've tested the above withcegma2zff
, but it doesn't seem to work withfathom
(at least with SNAP version 2017-03-01).