Hi everyone! I have a fasta file with amino acid sequences which have only the RefSeq accession number (e.g. WP_ + 9 digits) and I'm trying to get the name of the proteins so that I can add them to the fasta ids. Here is an example:
>WP_051684486.1
MSIFGEQFLARRNRDQIDLDNALQDVYEAVTGRESIRYSINSDEQVRKELERICFYLGVR
ADQDVPEYNDLEDMLDYITRPFAIMRRHILLTHHWWKNGDGPLLVSKKDSDELLALIPGR
LGGYYYTDFRSNKKIKLDRHNAGEFEKEAICFYKPLPLSSLSANELTGLLFKNMAAADLA
MLVLSGIGIVGVSLLIPFATKMVFEYVIPTGAMTLVGSFSFLLISSAMVAYIIAVIKQGY
ADRVKVRMEVYLTHGVMGRMINFPTSFFASKSTGELYRVFDNLREIPQILIDSVIVPIID
ISLAMLFIIQIAVIVPELLVPAVITVLLQFVCMAIGTFQAYGLLNIELQQDRKIQGLAIS
VYEGIQRIKLSGSESRIMAKWAGLYSKKAKVAYPAVFPVRFQTEMIAFISMMGMLAAFYK
GFTDNISISQFVAFVAAFGMLTGSITAFSNKSKDVIKLKPVLKMSDEILKECPEVSKEKL
IVDHLSGKIEVKDLTFRYGRDLPLILDGVSFTVHPGEYVAIVGKSGCGKSTLVRIFMGFE
KAVSGSVSYDDIDVERIDPRSLRRSIGVVMQSGNLFYDSIYRNIAISAPGLSMEEAWEAA
EKAGIAEDIRNMPMKMKTLIPQGGGGISGGQRQRIMIARALAAKPNILIFDEATSALDNI
TQKVVQDSLDQLNCTRIVIAHRLSTIQNCDRILVLDKGRIIEEGNYQELLKKGGFFANLI
KRQQL
On the NCBI RefSeq site, this maps to "ATP-binding cassette domain-containing protein", so I want to add that to the identifier in order to get:
>WP_051684486.1|ATP-binding cassette domain-containing protein
MSIFGEQFLARRNRDQIDLDNALQDVYEAVTGRESIRYSINSDEQVRKELERICFYLGVR
ADQDVPEYNDLEDMLDYITRPFAIMRRHILLTHHWWKNGDGPLLVSKKDSDELLALIPGR
LGGYYYTDFRSNKKIKLDRHNAGEFEKEAICFYKPLPLSSLSANELTGLLFKNMAAADLA
MLVLSGIGIVGVSLLIPFATKMVFEYVIPTGAMTLVGSFSFLLISSAMVAYIIAVIKQGY
ADRVKVRMEVYLTHGVMGRMINFPTSFFASKSTGELYRVFDNLREIPQILIDSVIVPIID
ISLAMLFIIQIAVIVPELLVPAVITVLLQFVCMAIGTFQAYGLLNIELQQDRKIQGLAIS
VYEGIQRIKLSGSESRIMAKWAGLYSKKAKVAYPAVFPVRFQTEMIAFISMMGMLAAFYK
GFTDNISISQFVAFVAAFGMLTGSITAFSNKSKDVIKLKPVLKMSDEILKECPEVSKEKL
IVDHLSGKIEVKDLTFRYGRDLPLILDGVSFTVHPGEYVAIVGKSGCGKSTLVRIFMGFE
KAVSGSVSYDDIDVERIDPRSLRRSIGVVMQSGNLFYDSIYRNIAISAPGLSMEEAWEAA
EKAGIAEDIRNMPMKMKTLIPQGGGGISGGQRQRIMIARALAAKPNILIFDEATSALDNI
TQKVVQDSLDQLNCTRIVIAHRLSTIQNCDRILVLDKGRIIEEGNYQELLKKGGFFANLI
KRQQL
How would I go about this?
I haven't used RefSeq before. Is there a way to get all RefSeq definitions as a file? If so, you can use some basic text processing in Unix to map to your .fa.
I know that every RefSeq accession has a Identical Protein Groups page on NCBI ( in the case above it's https://www.ncbi.nlm.nih.gov/ipg/WP_051684486.1 ) where I can see the protein's annotation and download a csv/fasta file with the annotated sequence, but I honestly don't know if there is a way to get all the RefSeq definitions as a file.
Are you dealing only with C.aminophilum? Also: https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#allcomplete
There are protein sequences from all types of prokaryotic organisms. I have tried querying IPG myself using the search string my other colleagues used, and I was given the sequences fully annotated as I expected, so I guess it was an error or some kind of preprocessing from their part!
You can use Entrez eutils to do that. I think you are limited to 3 queries per second.