I have the fasta files and genome annotation (gff) files for a number of species, and I am now trying to create proteome files for these species. I have tried extracting and translating only CDS sequences, and also only protein-coding gene sequences, however I am getting incredibly low BUSCO scores (<5%) for these proteomes.
To test if I have been following the right steps to create a proteome file, I have tried to use a genome that already has a proteome file available for download from NCBI, and see if I can recreate the proteome. I am unable to do so after translating a combination of different gene features to try and figure out how the proteome file was made.
What are the steps to creating a proteome file?
What do you mean by "a proteome file"? Do you wish to create a file with FASTA-format sequences of all protein isoforms from a specific species?
I think it is using
gffread
to extract the proteins using GFF and sequence.Also AGAT can help. See answer in this --> How to get proteins from GFF file resulted from MAKER annotation
From what I can tell, gffread can be used to extract nucleotide sequences from the genome fasta file based on the features described in the gff file. If so, I have already tried this and extracted sequences for both CDS and protein coding genes. However, after translation into a .faa proteome file, the sequences that I have do not match the sequences from the proteome available on NCBI. I am wondering if I extracting the correct feature? Or if I am missing a step?
Which genome from NCBI did you try to use a positive control? Can you post the accession?
Yes - I mean a .faa file that contains the amino acid sequences for protein-coding genes.
I've never done it or anything like it, but here's what my gut says:
NM_
/ENST
transcript identifiers from the GFF fileNP_
/ENSP
) corresponding to these transcript identifiers