Hello, I am referencing an old entry about the gff2fasta.pl script talked about here, and located here. I figured asking a new question is easier than reviving a 5.5 year old thread.
Anyway, I was hoping for some advice on how the modify the script to correctly parse a somewhat unusual .gff file I was given. Here are two entries from the file:
7000000037415267 . gene 21339 22504 . + . ID=7000003035155523;Function=polygalacturonase%2C%20putative;Name=PITG_19619
7000000037415267 . mRNA 21339 22504 . + . ID=7000003035155526;Parent=7000003035155523;Function=polygalacturonase%2C%20putative;Name=PITG_19619
7000000037415267 . exon 21339 21714 . + . ID=7000003035155526.exon2;Parent=7000003035155526
7000000037415267 . CDS 21339 21714 . + 0 ID=cds.7000003035155526;Parent=7000003035155526
7000000037415267 . exon 21749 22170 . + . ID=7000003035155526.exon3;Parent=7000003035155526
7000000037415267 . CDS 21749 22170 . + 2 ID=cds.7000003035155526;Parent=7000003035155526
7000000037415267 . exon 22307 22504 . + . ID=7000003035155526.exon4;Parent=7000003035155526
7000000037415267 . CDS 22307 22504 . + 0 ID=cds.7000003035155526;Parent=7000003035155526
7000000037414998 . gene 679960 682584 . + . ID=7000003035181604;Function=conserved%20hypothetical%20protein;Name=PITG_09139
7000000037414998 . mRNA 679960 682584 . + . ID=7000003035181607;Parent=7000003035181604;Function=conserved%20hypothetical%20protein;Name=PITG_09139
7000000037414998 . five_prime_UTR 679960 680620 . + . ID=7000003035181607.utr5p1;Parent=7000003035181607
7000000037414998 . five_prime_UTR 680710 680802 . + . ID=7000003035181607.utr5p2;Parent=7000003035181607
7000000037414998 . five_prime_UTR 680907 680909 . + . ID=7000003035181607.utr5p3;Parent=7000003035181607
7000000037414998 . exon 679960 680620 . + . ID=7000003035181607.exon1;Parent=7000003035181607
7000000037414998 . exon 680710 680802 . + . ID=7000003035181607.exon2;Parent=7000003035181607
7000000037414998 . exon 680907 681227 . + . ID=7000003035181607.exon3;Parent=7000003035181607
7000000037414998 . CDS 680910 681227 . + 0 ID=cds.7000003035181607;Parent=7000003035181607
7000000037414998 . exon 681298 681489 . + . ID=7000003035181607.exon4;Parent=7000003035181607
7000000037414998 . CDS 681298 681489 . + 0 ID=cds.7000003035181607;Parent=7000003035181607
7000000037414998 . exon 681563 682584 . + . ID=7000003035181607.exon5;Parent=7000003035181607
7000000037414998 . CDS 681563 682174 . + 0 ID=cds.7000003035181607;Parent=7000003035181607
7000000037414998 . three_prime_UTR 682175 682584 . + . ID=7000003035181607.utr3p1;Parent=7000003035181607
The gff2fasta.pl script is using, as entry/gene names, whatever string is behind "ID=" entry. As in, the genes output will be:
>7000003035155523
ATGCCTTTAGCGACGATCACTCTCCTCTTCTTCGCTAGCTTACCTCCCCAATCCACTCTT...
>7000003035181604
GGTGAACATGTTGTCTGTATTGTCTGTACTTGCCGACCATGAGCTCCTCGGTAGTGCACA...
And the output for mRNA, peptides, cds will be:
>7000003035155526
MPLATITLLFFASLPPQSTLHSAICFLPTQRPLKVQPAMKLVSSAFGVFALLAAFVSGST...
>7000003035181607
MSFSKSNLPPTLPVAIKKEREDPSSLSGSMSIPGSSSSIPRKDSIGWGADDFLGMISHTP...
Is there a way to name each line of the resulting fasta file to the string following "Name="? In these cases, that would be:
>PITG_19619
ATGCCTTTAGCGACGATCACTCTCCTCTTCTTCGCTAGCTTACCTCCCCAATCCACTCTT...
>PITG_19619
GGTGAACATGTTGTCTGTATTGTCTGTACTTGCCGACCATGAGCTCCTCGGTAGTGCACA...
and
>PITG_19619
MPLATITLLFFASLPPQSTLHSAICFLPTQRPLKVQPAMKLVSSAFGVFALLAAFVSGST...
>PITG_19619
MSFSKSNLPPTLPVAIKKEREDPSSLSGSMSIPGSSSSIPRKDSIGWGADDFLGMISHTP...
Alternatively, maybe there is a way to modify the .gff file itself by swapping the string behind "ID=" with the string behind "Name="? I have only limited perl knowledge.
Much appreciated, Mike