I'm having trouble changing a gff file. I think the best way to do this may be the linux command line as I am having a headache getting a script to work correctly. What I would like to do is:
1) remove any text before or after the number of the Name attribute up to the "=" and ";" respectively, highlighted in yellow.
2) for each gene and complementary mRNA with the same name beginning with "fgene" or "augus" or "genema" or "snap" to be changed to a sequential number starting from "20000", so the first gene and mRNA example below would be Name=20000 for both mRNA and gene then the next gene and mRNA would be Name=20001 etc
Please note: I have sorted the file by position so each gene and mRNA should be on following lines although with a number of CDS lines following each gene and mRNA.
I have this:
Chromosome_2_Copy maker gene 60155 61282 . - . Name=mRNAHGSG_07981gene;ID=maker-Chromosome_2-exonerate_est2genome-gene-0.53-mRNA-1-gene
Chromosome_2_Copy maker mRNA 60155 61282 100.0 - . Name=mRNAHGSG_07981;ID=maker-Chromosome_2-exonerate_est2genome-gene-0.53-mRNA-1;Parent=maker-Chromosome_2-exonerate_est2genome-gene-0.53-mRNA-1-gene
Chromosome_2_Copy maker CDS 60743 60970 . - . ID=maker-Chromosome_2Chromosome_2-exonerate_est2genome-gene-0.53-mRNA-1
Chromosome_2_Copy maker CDS 61019 61282 . - . ID=maker-Chromosome_2Chromosome_2-exonerate_est2genome-gene-0.53-mRNA-1
Chromosome_2_Copy maker CDS 62547 63546 . - . ID=augustus_masked-Chs_masked-Chromosome_2-processed-gene-0.14-mRNA-1
Chromosome_2_Copy maker gene 65607 66745 . + . Name=fgenesh_075_N;ID=fgenesh_masked-Chromosome_2-processed-gene-0.75
Chromosome_2_Copy maker mRNA 65607 66745 . + . Name=fgenesh_075_N;ID=fgenesh_masked-Chromosome_2-processed-gene-0.75-mRNA-1;Parent=fgenesh_masked-Chromosome_2-processed-gene-0.75
Chromosome_2_Copy maker CDS 65775 65836 . + . ID=fgenesh_masked-Chromosome_2-processed-gene-0.75-mRNA-1:cds;Parent=fgenesh_masked-Chromosome_2-processed-gene-0.75-mRNA-1
UPDATE:
This is a script I was playing with but I had removed the first part of the Name before the number via the command line to start with before running the script, I still have to work out how I did this and it's not great.
#!/usr/bin/perl
#BUG AFTER THE RENAMING FOR SOME REASON THE CDS Name when loaded in geneious becomes "CDS;CDS" but I can't see that I have added anything and the gff file doesn't seem to have checged this part??
use warnings;
my $gffFile = shift;
open (GFF, "<$gffFile");
my $filename = "Chrom1.gff";
open FILE, ">>$filename" or die "Could not open file '$filename' $!";
#counter for new gene names
my $ii=20000;
while (<GFF>) {
#skip header line
if ($. < 3) {
print FILE $_;
}else{
#remove return character
chomp;
#split columns by tabs
my @col = split /\t/;
#split the last column with the Name attribute in
my @lastcolumn = split /;/, $col[8];
my @Name;
#ALREADY REMOVED THE STARTING mRNAHGSG_ at command line FORGOTTON COMMAND THOUGH :(
if($lastcolumn[0] =~ /^Name=[0-9]+gene/){
#get just Name=number
@Name = split ("gen",$lastcolumn[0]);
##rename gene names for some
# if (($lastcolumn[0] =~ /^Name.augu.*/) || ($lastcolumn[0] =~ /^Name.genema.*/) || ($lastcolumn[0] =~ /^Name.fgene.*/) || ($lastcolumn[0] =~ /^Name.snap.*/)){
print FILE "$col[0]" . "\t" . "$col[1]" . "\t" . "$col[2]" . "\t" . "$col[3]" . "\t" . "$col[4]" . "\t" . "$col[5]" . "\t" . "$col[6]" . "\t" . "$col[7]" . "\t" . $Name[0] . ";";
my $nums = @lastcolumn;
for (my $i=1; $i < $nums; $i++) {
if ($i+1<$nums) {
print FILE "$lastcolumn[$i]" . ";";
}else{
print FILE "$lastcolumn[$i]" . "\n";
}
}
}else{
print FILE $_ . "\n";
# if($data[0] !~ /^Name=[0-9]+gene/){
#
# print $data[1] . "\t" . $data[0] . "\n";
# }
}
}
}
close FILE;
UPDATE2:
The field delimiter seems fine. It only doesn't work for the addition I added the first section of the line with _M in the name gets replaced by an ID=string.
I have this as input:
Chromosome_2_Copy maker gene 60155 61282 . - . Name=mRNAHGSG_07981gene;ID=maker-Chromosome_2-exonerate_est2genome-gene-0.53-mRNA-1-gene
Chromosome_2_Copy maker mRNA 60155 61282 100.0 - . Name=mRNAHGSG_07981;ID=maker-Chromosome_2-exonerate_est2genome-gene-0.53-mRNA-1;Parent=maker-Chromosome_2-exonerate_est2genome-gene-0.53-mRNA-1-gene
Chromosome_2_Copy maker CDS 60743 60970 . - . ID=maker-Chromosome_2Chromosome_2-exonerate_est2genome-gene-0.53-mRNA-1
Chromosome_2_Copy maker CDS 61019 61282 . - . ID=maker-Chromosome_2Chromosome_2-exonerate_est2genome-gene-0.53-mRNA-1
Chromosome_2_Copy maker CDS 62547 63546 . - . ID=augustus_masked-Chs_masked-Chromosome_2-processed-gene-0.14-mRNA-1
Chromosome_2_Copy maker gene 65607 66745 . + . Name=fgenesh_075_N;ID=fgenesh_masked-Chromosome_2-processed-gene-0.75
Chromosome_2_Copy maker mRNA 65607 66745 . + . Name=fgenesh_075_N;ID=fgenesh_masked-Chromosome_2-processed-gene-0.75-mRNA-1;Parent=fgenesh_masked-Chromosome_2-processed-gene-0.75
Chromosome_2_Copy maker CDS 65775 65836 . + . ID=fgenesh_masked-Chromosome_2-processed-gene-0.75-mRNA-1:cds;Parent=fgenesh_masked-Chromosome_2-processed-gene-0.75-mRNA-1
Chromosome_2_Copy maker gene 65707 66845 . + . Name=12345_M;ID=fgenesh_masked-Chromosome_2-processed-gene-0.75
Chromosome_2_Copy maker mRNA 65707 66845 . + . Name=12345_M;ID=fgenesh_masked-Chromosome_2-processed-gene-0.75-mRNA-1;Parent=fgenesh_masked-Chromosome_2-processed-gene-0.75
I'm trying this to first sort then do the changes but I need to add something to not change anything if _M in name, any ideas how I can add it to this?
sort -t $'\t' -k 4,4 -V Chromosome_1.gff -o Chromosome_1_G.gff
cat Chromosome_1_G.gff | awk 'BEGIN{FS="\t";OFS="\t";id=19999} {if ($3 ~ /gene/ || $3 ~ /mRNA/) {split($9,a,";"); split(a[1],b,"="); if (b[2] ~ /^fgene/ || b[2] ~ /^augus/ || b[2] ~ /^genema/ || b[2] ~ /^snap/){if($3 ~ /gene/){id=id+1}; ss="<a href="file:///\\1">\\1</a>"id"<a href="file:///\\3">\\3</a>";} else {ss="<a href="file:///\\1\2\3">\\1\\2\\3</a>"}; s=gensub(/([[:alpha:]]*=)[^[:digit:];]*([[:digit:]]+)[^[:digit:];]*(;.+)/,ss, "g", $9); print $1,$2,$3,$4,$5,$6,$7,$8,s } else {print} }' > Chromosome_1.gff
Example output that I would like:
Chromosome_2_Copy maker gene 60155 61282 . - . Name=07981;ID=maker-Chromosome_2-exonerate_est2genome-gene-0.53-mRNA-1-gene
Chromosome_2_Copy maker mRNA 60155 61282 100.0 - . Name=07981;ID=maker-Chromosome_2-exonerate_est2genome-gene-0.53-mRNA-1;Parent=maker-Chromosome_2-exonerate_est2genome-gene-0.53-mRNA-1-gene
Chromosome_2_Copy maker CDS 60743 60970 . - . ID=maker-Chromosome_2Chromosome_2-exonerate_est2genome-gene-0.53-mRNA-1
Chromosome_2_Copy maker CDS 61019 61282 . - . ID=maker-Chromosome_2Chromosome_2-exonerate_est2genome-gene-0.53-mRNA-1
Chromosome_2_Copy maker CDS 62547 63546 . - . ID=augustus_masked-Chs_masked-Chromosome_2-processed-gene-0.14-mRNA-1
Chromosome_2_Copy maker gene 65607 66745 . + . Name=20000;ID=fgenesh_masked-Chromosome_2-processed-gene-0.75
Chromosome_2_Copy maker mRNA 65607 66745 . + . Name=20000;ID=fgenesh_masked-Chromosome_2-processed-gene-0.75-mRNA-1;Parent=fgenesh_masked-Chromosome_2-processed-gene-0.75
Chromosome_2_Copy maker CDS 65775 65836 . + . ID=fgenesh_masked-Chromosome_2-processed-gene-0.75-mRNA-1:cds;Parent=fgenesh_masked-Chromosome_2-processed-gene-0.75-mRNA-1
Chromosome_2_Copy maker gene 65707 66845 . + . Name=12345_M;ID=fgenesh_masked-Chromosome_2-processed-gene-0.75
Chromosome_2_Copy maker mRNA 65707 66845 . + . Name=12345_M;ID=fgenesh_masked-Chromosome_2-processed-gene-0.75-mRNA-1;Parent=fgenesh_masked-Chromosome_2-processed-gene-0.75
While I'm sure you could put together a very long one-liner, you might as well just write a small perl or python script. If the one you've been working on isn't working, then post it and just say what's not working.
I updated by create more problems I think.
Please add some example of result you want. easy to understand.
In addition, there is no name beginning with "fgene" or "augus" or "genema" or "snap" in "the first gene and mRNA example" Do you mean
or
?
I believe one-liner awk or perl would tackle this. agree with @DevonRyan