Question

Determine NCBI Nucleotide source of .fasta amino acid file

0

Entering edit mode

5.3 years ago

LRStar ▴ 200

I have a .fasta file with amino acid sequences. The beginning of the file is as follows:

>lcl|NC_019674.1_prot_1 [locus_tag=BN341_RS00005] [protein=OmpA family protein] [pseudo=true] [location=join(1804546..1804601,1..456)] [gbkey=CDS]
MKKWFLAAAVVACVLMTGCPPRLKKPPPPPNPPPNLKNTPCPKKRPRPSP*KNPSPM*KVARLSGKCILI
LTNTMCVQTCKAQSMKP*KKSKNTV*KYSWRATPMSLVQANIILP*ATNAALV*KMF*LSRASVRTVLKW
*VLEKPNPFARKKLQSATVKTAVLTSKLWT

I am trying to find the source of this file. I believe I obtained it on NCBI Nucleotide (https://www.ncbi.nlm.nih.gov/nuccore/) while searching for the complete genome of Helicobacter species. Once I found the species, I believe I clicked on "Send to", "Coding sequences", and then "FASTA protein". Then, I downloaded that as .fasta file.

Now, I am trying to determine the exact origin of this .fasta file I have. I am attempting to give the NCBI Nucleotide link to colleagues. Is it possible for me to 'reverse engineer' this type of file and determine where I downloaded it from?

fasta ncbi nucleotide • 1.5k views

ADD COMMENT • link updated 5.3 years ago by GenoMax 151k • written 5.3 years ago by LRStar ▴ 200

0

Entering edit mode

You could also search NCBI with NC_019674 which will lead you to this genome page. Protein and nucleotide fasta sequences are available in top box. Note: These are representative sequences for multiple genomes and are labeled with WP identifiers.

ADD REPLY • link 5.3 years ago by GenoMax 151k

0

Entering edit mode

5.3 years ago

GenoMax 151k

Looks like you may have retrieved that file by using Entrezdirect like so:

$ esearch -db nuccore -query "NC_019674.1" | efetch -format fasta_cds_aa 
>lcl|NC_019674.1_prot_1 [locus_tag=BN341_RS00005] [protein=OmpA family protein] [pseudo=true] [location=join(1804546..1804601,1..456)] [gbkey=CDS]
MKKWFLAAAVVACVLMTGCPPRLKKPPPPPNPPPNLKNTPCPKKRPRPSP*KNPSPM*KVARLSGKCILI
LTNTMCVQTCKAQSMKP*KKSKNTV*KYSWRATPMSLVQANIILP*ATNAALV*KMF*LSRASVRTVLKW
*VLEKPNPFARKKLQSATVKTAVLTSKLWT
>lcl|NC_019674.1_prot_WP_015105870.1_2 [locus_tag=BN341_RS00010] [protein=TPR repeat containing exported protein; Putative periplasmic protein contains a protein prenylyltransferase domain] [protein_id=WP_015105870.1] [location=457..1398] [gbkey=CDS]
MRFLGLLVGGLLCAEPSAFELQSGATKQELSTLKSSNKNLGDILTALKGQTNGLLQGQEGLRSLVEGQGI
RLKKATDALNAHSDELKALKSTQDAQADLIKQQADLIHTLKTQIQTNQDALANFEKKNQETQQLLENMRA
...................

So if you need to get the nucleotide sequence then you should do the following (sequence truncated to save space):

$ esearch -db nuccore -query "NC_019674.1" | efetch -format fasta_cds_na 
>lcl|NC_019674.1_cds_1 [locus_tag=BN341_RS00005] [protein=OmpA family protein] [pseudo=true] [location=join(1804546..1804601,1..456)] [gbkey=CDS]
ATGAAAAAGTGGTTTTTAGCCGCCGCAGTTGTGGCGTGTGTGTTGATGACAGGGTGCCCCCCCAGGCTAA
AGAAGCCACCCCCGCCCCCAAACCCGCCCCCAAACCTGAAGAACACACCGTGCCCAAAGAAGAGGCCCAG
GCCAAGCCCGTAGAAAAACCCAAGCCCCATGTAGAAAGTGGCACGATTGTCGGGCAAGTGTATTTTGATT
......
>lcl|NC_019674.1_cds_WP_015105870.1_2 [locus_tag=BN341_RS00010] [protein=TPR repeat containing exported protein; Putative periplasmic protein contains a protein prenylyltransferase domain] [protein_id=WP_015105870.1] [location=457..1398] [gbkey=CDS]
GTGCGGTTTTTAGGCTTGCTTGTGGGGGGGCTCTTGTGCGCTGAGCCCTCCGCTTTTGAACTGCAAAGTG
GGGCGACCAAGCAAGAGTTAAGTACCCTAAAAAGCAGCAATAAAAACCTAGGTGACATCTTAACCGCGCT
.................

ADD COMMENT • link 5.3 years ago by GenoMax 151k

score 2 · Accepted Answer · 2020-01-17

2

Entering edit mode

5.3 years ago

gb ★ 2.2k

this? https://www.ncbi.nlm.nih.gov/nuccore/NC_019674.1/ or this? https://www.ncbi.nlm.nih.gov/nuccore/NC_019674.1?location=1804546:1804601,1:456

ADD COMMENT • link 5.3 years ago by gb ★ 2.2k

0

Entering edit mode

Thanks @gb. I believe it is the first one. What was your process for determining that? I have a few other files like this and believe perhaps my navigation skill son NCBI Nucleotide are not up to par... because I often cannot reverse engineer and figure out where my files came from. (Yes, I plan to take better notes when I create files in the future as well :)

ADD REPLY • link 5.3 years ago by LRStar ▴ 200

1

Entering edit mode

To be honest I would never thought my comment would help you. But, I can try to explain. The header or description of a sequence is the line starting with ">". In your case:

>lcl|NC_019674.1_prot_1 [locus_tag=BN341_RS00005] [protein=OmpA family protein] [pseudo=true] [location=join(1804546..1804601,1..456)] [gbkey=CDS]

The header of sequences in the NCBI database are build in a specific way, it is mostly split with "|" characters. The first item in your case is:

>lcl

That first item tells you what kind of id the next item is. (https://en.wikipedia.org/wiki/FASTA_format#NCBI_identifiers). The next one is the identifier itself, in your case:

NC_019674.1

You can use this "code" to look up the sequence on the ncbi website. In your case this record contains a lot of information and will not be fully shown on the page by default. You can get more info if you click on "customize view". You can look for those id's here for example https://www.ncbi.nlm.nih.gov/ use the search field on top of the page.

Oke, to add to this comment I want to say that this is a very simple and very basic explanation. Maybe some one else wants to explain it better or in more detail. For example there are many more ways to look up those id's with scripts, R packages etc. Even the id or mostly called accession is build up in a specific way https://www.ncbi.nlm.nih.gov/Sequin/acc.html and those accession are also connected with taxonomy, bioprojects and many more things.

ADD REPLY • link 5.3 years ago by gb ★ 2.2k