>gi|54036156|sp|Q9DBC7.3|KAP0_MOUSE RecName: Full=cAMP-dependent protein kinase type I-alpha regulatory subunit
MASGSMATSEEERSLRECELYVQKHNIQALLKDSIVQLCTTRPERPMAFLREYFERLEKEEARQIQCLQK
TGIRTDSREDEISPPPPNPVVKGRRRRGAISAEVYTEEDAASYVRKVIPKDYKTMAALAKAIEKNVLFSH
LDDNERSDIFDAMFPVSFIAGETVIQQGDEGDNFYVIDQGEMDVYVNNEWATSVGEGGSFGELALIYGTP
RAATVKAKTNVKLWGIDRDSYRRILMGSTLRKRKMYEEFLSKVSILESLDKWERLTVADALEPVQFEDGQ
KIVVQGEPGDEFFIILEGTAAVLQRRSENEEFVEVGRLGPSDYFGEIALLMNRPRAATVVARGPLKCVKL
DRPRFERVLGPCSDILKRNIQQYNSFVSLSV
I bet there's a way to do this with e-utils, but I can't figure out how. I realize that this can be done using the Ensembl Biomart, but the ID conversion from Entrez to Ensembl gene IDs results in lots of duplicates, i.e. one to many and many to one mappings both ways.
... returns 37 lines with unique protein accession numbers. I can't see what's different about the line containing 54036156. A simple sort|uniq wouldn't work in this case.
I'm looking at your answer two years later and am wondering if I could get some help with your updated code (for finding protein IDs as opposed to protein FASTA sequences). I tried the code as typed, but I get the following output:
sort: multi-character tab ` '
sort: multi-character tab ` '
join: illegal tab character specification
cut: bad delimiter
cut: bad delimiter
with no results (protein_gi.txt is blank, as is ordered_geneids.txt). When I remove -t ' '
after sort, ordered_geneids.txt is generated appropriately, but I am at a loss as to how I could alter the curl line to get the code to function. Any help would be greatly appreciated.
ADD REPLY
• link
updated 2.2 years ago by
Ram
45k
•
written 9.9 years ago by
hmg
▴
20
It seems that the default genbank download does not contain that information because this is a related item and not the actual protein it encodes for. Perhaps there is a more detailed genbank or other format that does contain that information
Thanks Pierre. Do you know how this will scale to ~17,000 IDs?
No. I would download NCBI ftp://ftp.ncbi.nih.gov/gene/DATA/gene2accession.gz , extract the protein acn and use NCBI batch entrez with those acns.
Did this.
returns 37 lines. For gene id 19084, the protein accession I'm looking for is 54036156. How do I know this is the right one out of those 37 lines?
Using this approach I get a lot of duplicates unfortunately. Biomart, with your toy example of 3 ids, gives no duplicates.
what is a "duplicate" ? same id or same sequence ? anyway, It's easy to remove the sequence/name duplicate with a simple sort | uniq
Many, many protein IDs for the same gene ID. E.g. searching for entrez gene 19804 ( http://www.ncbi.nlm.nih.gov/gene/?term=19084 ) returns protein accession number Q9DBC7.3 (genepept 54036156 - http://www.ncbi.nlm.nih.gov/protein/54036156 ). But executing the command above:
... returns 37 lines with unique protein accession numbers. I can't see what's different about the line containing 54036156. A simple sort|uniq wouldn't work in this case.
I meant sort| uniq after linearizing the fasta sequence to get the unique sequences. Or just use "sort -u" to only keep one record per gene-id.
Thanks Pierre. Hmm, looks like I'm exhausting my available 32GB memory at the fgrep step.
I was using it to go faster. You can remove it from the pipeline.
Hello,
I'm looking at your answer two years later and am wondering if I could get some help with your updated code (for finding protein IDs as opposed to protein FASTA sequences). I tried the code as typed, but I get the following output:
with no results (
protein_gi.txt
is blank, as isordered_geneids.txt
). When I remove-t ' '
after
sort
,ordered_geneids.txt
is generated appropriately, but I am at a loss as to how I could alter thecurl
line to get the code to function. Any help would be greatly appreciated.