Question

How to make a proteome file

0

Entering edit mode

21 months ago

sophiematthews03 • 0

I have the fasta files and genome annotation (gff) files for a number of species, and I am now trying to create proteome files for these species. I have tried extracting and translating only CDS sequences, and also only protein-coding gene sequences, however I am getting incredibly low BUSCO scores (<5%) for these proteomes.

To test if I have been following the right steps to create a proteome file, I have tried to use a genome that already has a proteome file available for download from NCBI, and see if I can recreate the proteome. I am unable to do so after translating a combination of different gene features to try and figure out how the proteome file was made.

What are the steps to creating a proteome file?

proteome genomics • 2.0k views

ADD COMMENT • link updated 21 months ago by GenoMax 153k • written 21 months ago by sophiematthews03 • 0

0

Entering edit mode

What do you mean by "a proteome file"? Do you wish to create a file with FASTA-format sequences of all protein isoforms from a specific species?

ADD REPLY • link 21 months ago by Ram 45k

0

Entering edit mode

I think it is using gffread to extract the proteins using GFF and sequence.

Also AGAT can help. See answer in this --> How to get proteins from GFF file resulted from MAKER annotation

ADD REPLY • link 21 months ago by GenoMax 153k

0

Entering edit mode

From what I can tell, gffread can be used to extract nucleotide sequences from the genome fasta file based on the features described in the gff file. If so, I have already tried this and extracted sequences for both CDS and protein coding genes. However, after translation into a .faa proteome file, the sequences that I have do not match the sequences from the proteome available on NCBI. I am wondering if I extracting the correct feature? Or if I am missing a step?

ADD REPLY • link 21 months ago by sophiematthews03 • 0

0

Entering edit mode

Which genome from NCBI did you try to use a positive control? Can you post the accession?

ADD REPLY • link 21 months ago by GenoMax 153k

0

Entering edit mode

Yes - I mean a .faa file that contains the amino acid sequences for protein-coding genes.

ADD REPLY • link 21 months ago by sophiematthews03 • 0

0

Entering edit mode

steps to creating a proteome file

I've never done it or anything like it, but here's what my gut says:

Get all NM_/ENST transcript identifiers from the GFF file
Look up protein identifiers (NP_/ENSP) corresponding to these transcript identifiers
Download the sequences for each of these proteins as you get their identifiers
Concatenate these sequences to get the proteome file

ADD REPLY • link 21 months ago by Ram 45k