How to make a proteome file
0
0
Entering edit mode
13 months ago

I have the fasta files and genome annotation (gff) files for a number of species, and I am now trying to create proteome files for these species. I have tried extracting and translating only CDS sequences, and also only protein-coding gene sequences, however I am getting incredibly low BUSCO scores (<5%) for these proteomes.

To test if I have been following the right steps to create a proteome file, I have tried to use a genome that already has a proteome file available for download from NCBI, and see if I can recreate the proteome. I am unable to do so after translating a combination of different gene features to try and figure out how the proteome file was made.

What are the steps to creating a proteome file?

proteome genomics • 1.4k views
ADD COMMENT
0
Entering edit mode

What do you mean by "a proteome file"? Do you wish to create a file with FASTA-format sequences of all protein isoforms from a specific species?

ADD REPLY
0
Entering edit mode

I think it is using gffread to extract the proteins using GFF and sequence.

Also AGAT can help. See answer in this --> How to get proteins from GFF file resulted from MAKER annotation

ADD REPLY
0
Entering edit mode

From what I can tell, gffread can be used to extract nucleotide sequences from the genome fasta file based on the features described in the gff file. If so, I have already tried this and extracted sequences for both CDS and protein coding genes. However, after translation into a .faa proteome file, the sequences that I have do not match the sequences from the proteome available on NCBI. I am wondering if I extracting the correct feature? Or if I am missing a step?

ADD REPLY
0
Entering edit mode

Which genome from NCBI did you try to use a positive control? Can you post the accession?

ADD REPLY
0
Entering edit mode

Yes - I mean a .faa file that contains the amino acid sequences for protein-coding genes.

ADD REPLY
0
Entering edit mode

steps to creating a proteome file

I've never done it or anything like it, but here's what my gut says:

  1. Get all NM_/ENST transcript identifiers from the GFF file
  2. Look up protein identifiers (NP_/ENSP) corresponding to these transcript identifiers
  3. Download the sequences for each of these proteins as you get their identifiers
  4. Concatenate these sequences to get the proteome file
ADD REPLY

Login before adding your answer.

Traffic: 1696 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6