How I Can Obtain "Product" Features From A List Of Ncbi Accession?
2
1
Entering edit mode
13.0 years ago
Luke ▴ 240

I have a table of SNVs. Each row is a different exonic SNV. The column 4 of my table cointains a list of NCBI accession numbers. How I can append at the end of each row (i.e. column 5) the information of the gene product? Thank you! Luca

mutation retrieval genbank identifiers • 3.1k views
ADD COMMENT
0
Entering edit mode

Specifically, which information do you wish to retrieve and add to column 5? Are you intending to parse info from GenBank format or another source?

ADD REPLY
0
Entering edit mode

Hi Larry! I wish retrieve this information from GenBank:

[...] FEATURES Location/Qualifiers [...] gene 1..3285 /gene="CWF19L2" ==> /note="CWF19-like 2, cell cycle control /dbxref="GeneID:143884" /dbxref="HGNC:26508" /dbxref="HPRD:13102" [...] CDS 31..2715 /gene="CWF19L2" /codonstart=1 ==> /product="CWF19-like protein 2" [...]

For example those indicated by the arrows. Thank you!

ADD REPLY
0
Entering edit mode

Yes, Larry, from GenBank! For example /note="CWF19-like 2, cell cycle control" under the section "gene" or /product="CWF19-like protein 2" under the section "CDS" of the GenBank file.

ADD REPLY
0
Entering edit mode

http://www.ncbi.nlm.nih.gov/nuccore/124487290?report=fasta is this format you wanna add?: >gi|124487290|ref|NM_001081077.1| Mus musculus CWF19-like 1, cell cycle control (S. pombe) (Cwf19l1), mRNA

ADD REPLY
0
Entering edit mode

I'm interested only in a brief description of the CDS product.

ADD REPLY
2
Entering edit mode
13.0 years ago

Say you have the following input:

A NM_001081077 A
B NM_001081078 B
C NM_001081079 C
D NM_001081080 D

and the following xslt stylesheet:


<xsl:stylesheet xmlns:xsl="&lt;a href=" http:="" www.w3.org="" 1999="" XSL="" Transform"="" rel="nofollow">http://www.w3.org/1999/XSL/Transform"
    version="1.0"
    >

  <xsl:output method="text"/>

  <xsl:template match="/">
    <xsl:text> </xsl:text>
    <xsl:value-of select="/GBSet/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='product']/GBQualifier_value"/>
    <xsl:text>
</xsl:text>
  </xsl:template>
</xsl:stylesheet>

the command line would be:

$ while read L ; do ID=`echo $L | cut -d ' ' -f 2`; echo -n $L; xsltproc --novalid stylesheet.xsl  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${ID}&rettype=gb&retmode=xml" ;done < input.txt

result:

A NM_001081077 A CWF19-like protein 1
B NM_001081078 B lactase-phlorizin hydrolase preproprotein
C NM_001081079 C opioid growth factor receptor-like protein 1
D NM_001081080 D PHD finger protein 3
ADD COMMENT
1
Entering edit mode
13.0 years ago
Rm 8.3k

you can get that information using eutils and curl and grep or awk for the regular expression you are looking for:

simplest will be:

curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NM_001081077&rettype=gb" | grep "/note="

more precisely if you are looking "note" within the "gene" feature:

curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NM_001081077&rettype=gb" | awk '/     gene/,/note=/' | grep "/note="

Putting into a Pierre's while loop:

 while read L ; do ID=`echo $L | cut -d ' ' -f 2`; echo -n $L; curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${ID}&rettype=gb" | awk '/ CDS/,/product=/' | grep "/product=" | sed 's/ *\/product=\"// ; s/"$//' ;done < input.txt

Output:

A NM_001081077 A CWF19-like protein 1
B NM_001081078 B lactase-phlorizin hydrolase preproprotein
C NM_001081079 C opioid growth factor receptor-like protein 1
D NM_001081080 D PHD finger protein 3
ADD COMMENT
1
Entering edit mode

for CDS product use this: awk '/ CDS/,/product=/' | grep "/product="

ADD REPLY

Login before adding your answer.

Traffic: 1800 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6