I am trying to extract annotated fasta sequences from a genome file, using a gff file and gffread
(version 0.9.8-0). The code that I found is simply gffread -w test.transcripts.fa -g genome.fna test.gff
and this works great on my local mac. Some example output can be seen below:
> gene-TM_RS04265 gene=gyrB CDS=1-1911
ATGGAAAAGTACTCCGCTGAAAGTATAAAGGTTTTAAAAGGACTGGAACCTGTCCGAATGCGACCCGGAA
TGTACATAGGATCCACGGGCAAACGTGGATTGCATCACCTCGTGTACGAAGTGGTTGACAACAGTGTTGA
TGAGGCACTCGCTGGATACTGCGACTGGATACGTGTGACTCTCCATGAAGATGGAAGTGTGGAAGTCGAG
GACAACGGAAGGGGAATCCCCGTTGACATACATCCAGAAGAGGGAAGAAGCGCTCTGGAAGTGGTTTTCA
CAGTTCTTCATGCGGGCGGAAAATTCTCCAAGGATTCCTACAAGATAAGCGGTGGCCTGCACGGTGTTGG
TGTATCGGTGGTGAACGCTCTTTCGGAGTGGCTGGAAGTGCGGGTACATCGTGACGGGAAGATCTACAGA
CAAAGATACGAAAGGGGTAAACCGGTCACACCTGTGGAAGTGATAGGAGAAACCGATAAGCACGGCACGA
TCGTTCGATTCAAACCCGATCCTCTCATATTTTCGGAGACAGAGTTCGATCCCGACATACTCGAACACAG
ATTGAGAGAGA...
Where the gene id is TM_RS04265, gene name is GyrB and the CDS is 1-1911.
My issue comes when I use an updated version of gffread (version 0.12.1) on our local server. I can run the same command with the same files (gffread -w test.transcripts.fa -g genome.fna test.gff
) but now I do not get the gene name in the output (as seen below).
> gene-TM_RS04265 CDS=1-1911
ATGGAAAAGTACTCCGCTGAAAGTATAAAGGTTTTAAAAGGACTGGAACCTGTCCGAATGCGACCCGGAA
TGTACATAGGATCCACGGGCAAACGTGGATTGCATCACCTCGTGTACGAAGTGGTTGACAACAGTGTTGA
TGAGGCACTCGCTGGATACTGCGACTGGATACGTGTGACTCTCCATGAAGATGGAAGTGTGGAAGTCGAG
GACAACGGAAGGGGAATCCCCGTTGACATACATCCAGAAGAGGGAAGAAGCGCTCTGGAAGTGGTTTTCA
CAGTTCTTCATGCGGGCGGAAAATTCTCCAAGGATTCCTACAAGATAAGCGGTGGCCTGCACGGTGTTGG
TGTATCGGTGGTGAACGCTCTTTCGGAGTGGCTGGAAGTGCGGGTACATCGTGACGGGAAGATCTACAGA
CAAAGATACGAAAGGGGTAAACCGGTCACACCTGTGGAAGTGATAGGAGAAACCGATAAGCACGGCACGA
TCGTTCGATTCAAACCCGATCCTCTCATATTTTCGGAGACAGAGTTCGATCCCGACATACTCGAACACAG
ATTGAGAGAGA...
As you can see we only get the gene ID and the CDS range but not the gene name.
I know this is probably a simple fix but I can not find anywhere in the gffread manual where I can add a flag to extract the gene name, along with the gene ID and CDS. So if anyone knows how to get gffread -w
to give an output that includes the gene name, that would be greatly appreciated.
If this feature is indeed missing from new version you may want to create an issue on their dev site and request that it be put back.