Question

what the difference between "translated cds" and "protein"

1

Entering edit mode

3.8 years ago

niuzx9581 ▴ 30

Hi!

I am going to download the protein genome annotation file of Wolbachia from NCBI for orthology analysis. But there are two amino acid sequence files, translated CDS and Protein FASTA. I have no idea for what annotation file I need to download? and what the difference between them? Thank you a lot for your answer!

sequence gene genome • 3.2k views

ADD COMMENT • link updated 3.8 years ago by Istvan Albert 102k • written 3.8 years ago by niuzx9581 ▴ 30

1

Entering edit mode

You might have the stop codon represented by a star within the translated CDS if the translation comes from a gff file.
Open both file and check a common sequence to understand wath could differ between those two files

ADD REPLY • link 3.8 years ago by Juke34 8.9k

0

Entering edit mode

are you retrieving sequences from Genbank or RefSeq?

ADD REPLY • link 3.8 years ago by Joe 21k

score 1 · Answer 1 · 2021-03-02

Based on some digging, I think this is the correct answer, but anyone with more authoritative knowledge can feel free to correct me.

You can take a look at the READMEs that NCBI include within the directory for the genomes: https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/README.txt

From the text:

   *_protein.faa.gz file
       FASTA format sequences of the accessioned protein products annotated on
       the genome assembly. The FASTA title is formatted as sequence 
       accession.version plus description.

   *_translated_cds.faa.gz
       FASTA sequences of individual CDS features annotated on the genomic 
       records, conceptually translated into protein sequence. The sequence 
       corresponds to the translation of the nucleotide sequence provided in the
       *_cds_from_genomic.fna.gz file.

which relies on

   *_cds_from_genomic.fna.gz
       FASTA format of the nucleotide sequences corresponding to all CDS 
       features annotated on the assembly, based on the genome sequence. See 
       the "Description of files" section below for details of the file format.

The tl;dr appears to be that protein.faa contains proteins which have accessions of their own in the protein database, whereas the CDS sequences and translated CDS sequence files are direct, simple, 'naive' in silico translations from the annotations on the genomes themselves.

To answer your original question: it depends what the analysis you intend to do is, but in 99% of cases I expect you'll want the proteins.faa file.

score 0 · Answer 2 · 2021-03-02

proteins vs translation, the bio package to the rescue.

https://www.bioinfo.help/

Although I need to explore a better solution for reporting these. Right now it raises a logger error (only visible in verbose mode) when translation of the DNA won't match the translation for the protein (this behaviour will probably change in the future).

# Get the Drosphila chromosome 1 (also has translation ready protein sequences embedded in the Genbank file)
bio fetch NT_033779
# Translate each nucleotide CDS as represented on the DNA
bio convert NT_033779  --fasta --type CDS --translate -v > foo

the translations in FASTA format will go into the file foo but on the screen it will print:

 *** translation mismatch for: NP_001259829.1
 *** translation mismatch for: NP_001137762.2
 *** translation mismatch for: NP_722676.3
 *** translation mismatch for: NP_608564.3
 *** translation mismatch for: NP_001259867.1

and so on for 87 records (out of 5707) . The reason is that some translations don't follow the rules ... it is biology after all.