I have a table of SNVs. Each row is a different exonic SNV. The column 4 of my table cointains a list of NCBI accession numbers. How I can append at the end of each row (i.e. column 5) the information of the gene product? Thank you! Luca
I have a table of SNVs. Each row is a different exonic SNV. The column 4 of my table cointains a list of NCBI accession numbers. How I can append at the end of each row (i.e. column 5) the information of the gene product? Thank you! Luca
Say you have the following input:
A NM_001081077 A
B NM_001081078 B
C NM_001081079 C
D NM_001081080 D
and the following xslt stylesheet:
<xsl:stylesheet xmlns:xsl="<a href=" http:="" www.w3.org="" 1999="" XSL="" Transform"="" rel="nofollow">http://www.w3.org/1999/XSL/Transform"
version="1.0"
>
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:text> </xsl:text>
<xsl:value-of select="/GBSet/GBSeq/GBSeq_feature-table/GBFeature/GBFeature_quals/GBQualifier[GBQualifier_name='product']/GBQualifier_value"/>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
the command line would be:
$ while read L ; do ID=`echo $L | cut -d ' ' -f 2`; echo -n $L; xsltproc --novalid stylesheet.xsl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${ID}&rettype=gb&retmode=xml" ;done < input.txt
result:
A NM_001081077 A CWF19-like protein 1
B NM_001081078 B lactase-phlorizin hydrolase preproprotein
C NM_001081079 C opioid growth factor receptor-like protein 1
D NM_001081080 D PHD finger protein 3
you can get that information using eutils and curl and grep or awk for the regular expression you are looking for:
simplest will be:
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NM_001081077&rettype=gb" | grep "/note="
more precisely if you are looking "note" within the "gene" feature:
curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NM_001081077&rettype=gb" | awk '/ gene/,/note=/' | grep "/note="
Putting into a Pierre's while loop:
while read L ; do ID=`echo $L | cut -d ' ' -f 2`; echo -n $L; curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${ID}&rettype=gb" | awk '/ CDS/,/product=/' | grep "/product=" | sed 's/ *\/product=\"// ; s/"$//' ;done < input.txt
Output:
A NM_001081077 A CWF19-like protein 1
B NM_001081078 B lactase-phlorizin hydrolase preproprotein
C NM_001081079 C opioid growth factor receptor-like protein 1
D NM_001081080 D PHD finger protein 3
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Specifically, which information do you wish to retrieve and add to column 5? Are you intending to parse info from GenBank format or another source?
Hi Larry! I wish retrieve this information from GenBank:
[...] FEATURES Location/Qualifiers [...] gene 1..3285 /gene="CWF19L2" ==> /note="CWF19-like 2, cell cycle control /dbxref="GeneID:143884" /dbxref="HGNC:26508" /dbxref="HPRD:13102" [...] CDS 31..2715 /gene="CWF19L2" /codonstart=1 ==> /product="CWF19-like protein 2" [...]
For example those indicated by the arrows. Thank you!
Yes, Larry, from GenBank! For example /note="CWF19-like 2, cell cycle control" under the section "gene" or /product="CWF19-like protein 2" under the section "CDS" of the GenBank file.
http://www.ncbi.nlm.nih.gov/nuccore/124487290?report=fasta is this format you wanna add?: >gi|124487290|ref|NM_001081077.1| Mus musculus CWF19-like 1, cell cycle control (S. pombe) (Cwf19l1), mRNA
I'm interested only in a brief description of the CDS product.