entrez utilities: snp
1
0
Entering edit mode
7.7 years ago
DanielC ▴ 210

Dear All,

I am trying to get the 3' and 5' UTR of the BRCA1 and BRCA2 mRNA. I came to know about entrez utilities to do it like this:

source: https://www.ncbi.nlm.nih.gov/books/NBK179288/

and the code is:

ThreePrimeUTRs() {
    xtract -pattern INSDSeq -ACC INSDSeq_accession-version -SEQ INSDSeq_sequence \
      -group INSDFeature -if INSDFeature_key -equals CDS -PRD "(-)" \
        -block INSDQualifier -if INSDQualifier_name \
          -equals product -PRD INSDQualifier_value \
        -block INSDFeature -pfc "\n" -element "&ACC" -rst \
          -last INSDInterval_to -element "&SEQ" "&PRD" |
    while read acc pos seq prd
    do
      if [ $pos -lt ${#seq} ]
      then
        echo -e ">$acc 3'UTR: $((pos+1))..${#seq} $prd"
        echo "${seq:$pos}" | fold -w 50
      elif [ $pos -ge ${#seq} ]
      then
        echo -e ">$acc NO 3'UTR"
      fi
    done
  }

  esearch -db nuccore -query "3.6.4.12 [ECNO]" |
  efilter -molecule mrna -source refseq |
  efetch -format gbc | ThreePrimeUTRs

When I run this I keep getting error saying;

**Unrecognized argument '-if'
No -element before 'INSDFeature_key'
Unrecognized argument '-equals'
No -element before 'CDS'**

Can someone please help me know what is going wrong? And, can I get the 5' UTR following the same code? And, finally, I also want to get the SNPs in the 3' and 5' UTR?

Thank you so much! DK

SNP • 2.9k views
ADD COMMENT
1
Entering edit mode

If you put the code in a file, make it executable and run it, it produces a result.

>XM_005708748.1 3'UTR: 1918..1941 ATP-dependent DNA helicase RecQ
gtgtggttttcaacaagttttaca
>XM_005707168.1 3'UTR: 3700..3734 ATP-dependent DNA helicase RecQ
gttgctttgggtttcacaaggtaaatttatgacaa
>XM_005706277.1 3'UTR: 1799..1836 ATP-dependent DNA helicase 2 subunit 1 isoform 1
gaacggccagtatacaacacccagatcagccaaatcaa
>XM_005706276.1 3'UTR: 1409..1481 ATP-dependent DNA helicase 2 subunit 1 isoform 2
tccgtcaaaatattcggatcctgatattcaacgatattataacggattac
aagctctggctctgaatcaaacc
>XM_005705233.1 3'UTR: 1632..1709 ATP-dependent DNA helicase RecQ
ctttattgtatgagaattttctgaatttctttgcagacatttctttcgca
tgtatcttataaacaactataagattgt
>NM_001278454.1 3'UTR: 6111..8976 chromodomain-helicase-DNA-binding protein 2
agcgactgagaaggggggggggaaacacgtcttgaaagacttggatgcaa
caaccagaaactctgaacatgctgctatcatcttgctgggtcaaggagga
ttttggaggagcaggtggaggaagactcagttctaatttgggttcccatt
ttgtttccccccctttctctcgttgaacattggaaccagacttgcctcgt
tctttttctttggtttgttttccccaatccaacggacacgtggagaattt
tcctcagccacagtgtttccccaaaaccgagaaggcggatcaatgctgct

truncated for brevity.

You will need to change your query (e.g. -query "BRCA") to get what you need.

ADD REPLY
0
Entering edit mode

Thanks! but when I run I keep getting this error above

Unrecognized argument '-if' No -element before 'INSDFeature_key' Unrecognized argument '-equals' No -element before 'CDS'*

ADD REPLY
0
Entering edit mode

Are you using the bash shell? If not, issue the command bash and then run the file at the new system prompt that should show up.

ADD REPLY
0
Entering edit mode

OK, Thanks. I will try and update here.

ADD REPLY
0
Entering edit mode

Isn't this question same as SNPs; entrez utilities ?

ADD REPLY
0
Entering edit mode

Not exactly, because in that I had no idea of the approach. Here, I have found a way, but having errors and missing features to get the SNPs.

ADD REPLY
0
Entering edit mode

I have it on good information that if you update your implementation of eutils, this should work fine.

ADD REPLY
2
Entering edit mode
7.7 years ago

using XSLT , assuming it's mRNA, 5'->3', with the correct annotation. I'm extracting the position of the left and right CDS:

<?xml version='1.0' encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.1'>
<xsl:output method="text" encoding="UTF-8"/>
<!-- pattern for root node -->
<xsl:template match="/">
<!-- when the XML root node is matched, search a pattern for the XML element(s) GBSet/GBSeq -->
<xsl:apply-templates select="/GBSet/GBSeq"/>
</xsl:template>
<!-- pattern for root GBSeq -->
<xsl:template match="GBSeq">
<!-- create a variable acn containing the Accession number -->
<xsl:variable name="acn" select="GBSeq_accession-version/text()"/>
<!--create a variable containing CDS start -->
<xsl:variable name="cdsstart">
<!-- search all GBInterval_from values of the Features named "CDS" -->
<xsl:for-each select="GBSeq_feature-table/GBFeature[GBFeature_key='CDS']/GBFeature_intervals/GBInterval/GBInterval_from/text()">
<!-- sort as a number -->
<xsl:sort data-type="number" order="descending"/>
<!-- print the firtst value only (the smallest CDS start ) -->
<xsl:if test="position()=1"><xsl:value-of select="."/></xsl:if>
</xsl:for-each>
</xsl:variable>
<!--create a variable containing CDS end -->
<xsl:variable name="cdsend">
<!-- search all GBInterval_to values of the Features named "CDS" -->
<xsl:for-each select="GBSeq_feature-table/GBFeature[GBFeature_key='CDS']/GBFeature_intervals/GBInterval/GBInterval_to/text()">
<!-- sort as a number , inverse order-->
<xsl:sort data-type="number" order="ascending"/>
<!-- print the firtst value only (the greater CDS end ) -->
<xsl:if test="position()=1"><xsl:value-of select="."/></xsl:if>
</xsl:for-each>
</xsl:variable>
<!-- print UTR 5-->
<xsl:text>&gt;</xsl:text>
<xsl:value-of select="concat($acn,'|-',$cdsstart)"/><xsl:text>|5' UTR
</xsl:text>
<xsl:value-of select="substring(GBSeq_sequence/text(),1,number($cdsstart) - 1 ) "/>
<!-- print UTR 3-->
<xsl:text>
&gt;</xsl:text>
<xsl:value-of select="concat($acn,'|',$cdsend)"/><xsl:text>-|3' UTR
</xsl:text>
<xsl:value-of select="substring(GBSeq_sequence/text(),number($cdsend) + 1 ) "/>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
view raw transform.xsl hosted with ❤ by GitHub
$ wget -q -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NM_007299.3&id=NR_027676.1&id=NM_007299.3&id=NM_007298.3&retmode=xml" | xsltproc --novalid transform.xsl - | fold -w 60
>NM_007299.3|-195|5' UTR
cttagcggtagccccttggtttccgtggcaacggaaaagcgcgggaattacagataaatt
aaaactgcgactgcgcggcgtgagctcgctgagacttcctggacgggggacaggctgtgg
ggtttctcagataactgggcccctgcgctcaggaggccttcaccctctgctctggttcat
tggaacagaaagaa
>NM_007299.3|2294-|3' UTR
ggcacctgtggtgacccgagagtgggtgttggacagtgtagcactctaccagtgccagga
gctggacacctacctgataccccagatcccccacagccactactgactgcagccagccac
aggtacagagccacaggaccccaagaatgagcttacaaagtggcctttccaggccctggg
agctcctctcactcttcagtccttctactgtcctggctactaaatattttatgtacatca
gcctgaaaaggacttctggctatgcaagggtcccttaaagattttctgcttgaagtctcc
cttggaaatctgccatgagcacaaaattatggtaatttttcacctgagaagattttaaaa
ccatttaaacgccaccaattgagcaagatgctgattcattatttatcagccctattcttt
ctattcaggctgttgttggcttagggctggaagcacagagtggcttggcctcaagagaat
agctggtttccctaagtttacttctctaaaaccctgtgttcacaaaggcagagagtcaga
cccttcaatggaaggagagtgcttgggatcgattatgtgacttaaagtcagaatagtcct
tgggcagttctcaaatgttggagtggaacattggggaggaaattctgaggcaggtattag
aaatgaaaaggaaacttgaaacctgggcatggtggctcacgcctgtaatcccagcacttt
gggaggccaaggtgggcagatcactggaggtcaggagttcgaaaccagcctggccaacat
ggtgaaaccccatctctactaaaaatacagaaattagccggtcatggtggtggacacctg
taatcccagctactcaggtggctaaggcaggagaatcacttcagcccgggaggtggaggt
tgcagtgagccaagatcataccacggcactccagcctgggtgacagtgagactgtggctc
aaaaaaaaaaaaaaaaaaaggaaaatgaaactagaagagatttctaaaagtctgagatat
atttgctagatttctaaagaatgtgttctaaaacagcagaagattttcaagaaccggttt
ccaaagacagtcttctaattcctcattagtaataagtaaaatgtttattgttgtagctct
ggtatataatccattcctcttaaaatataagacctctggcatgaatatttcatatctata
aaatgacagatcccaccaggaaggaagctgttgctttctttgaggtgatttttttccttt
gctccctgttgctgaaaccatacagcttcataaataattttgcttgctgaaggaagaaaa
agtgtttttcataaacccattatccaggactgtttatagctgttggaaggactaggtctt
ccctagcccccccagtgtgcaagggcagtgaagacttgattgtacaaaatacgttttgta
aatgttgtgctgttaacactgcaaataaacttggtagcaaacacttccaaaaaaaaaaaa
aaaaaa
>NR_027676.1|-|5' UTR
>NR_027676.1|-|3' UTR
>NM_007299.3|-195|5' UTR
cttagcggtagccccttggtttccgtggcaacggaaaagcgcgggaattacagataaatt
aaaactgcgactgcgcggcgtgagctcgctgagacttcctggacgggggacaggctgtgg
ggtttctcagataactgggcccctgcgctcaggaggccttcaccctctgctctggttcat
tggaacagaaagaa
>NM_007299.3|2294-|3' UTR
ggcacctgtggtgacccgagagtgggtgttggacagtgtagcactctaccagtgccagga
gctggacacctacctgataccccagatcccccacagccactactgactgcagccagccac
aggtacagagccacaggaccccaagaatgagcttacaaagtggcctttccaggccctggg
agctcctctcactcttcagtccttctactgtcctggctactaaatattttatgtacatca
gcctgaaaaggacttctggctatgcaagggtcccttaaagattttctgcttgaagtctcc
cttggaaatctgccatgagcacaaaattatggtaatttttcacctgagaagattttaaaa
ccatttaaacgccaccaattgagcaagatgctgattcattatttatcagccctattcttt
ctattcaggctgttgttggcttagggctggaagcacagagtggcttggcctcaagagaat
agctggtttccctaagtttacttctctaaaaccctgtgttcacaaaggcagagagtcaga
cccttcaatggaaggagagtgcttgggatcgattatgtgacttaaagtcagaatagtcct
tgggcagttctcaaatgttggagtggaacattggggaggaaattctgaggcaggtattag
aaatgaaaaggaaacttgaaacctgggcatggtggctcacgcctgtaatcccagcacttt
gggaggccaaggtgggcagatcactggaggtcaggagttcgaaaccagcctggccaacat
ggtgaaaccccatctctactaaaaatacagaaattagccggtcatggtggtggacacctg
taatcccagctactcaggtggctaaggcaggagaatcacttcagcccgggaggtggaggt
tgcagtgagccaagatcataccacggcactccagcctgggtgacagtgagactgtggctc
aaaaaaaaaaaaaaaaaaaggaaaatgaaactagaagagatttctaaaagtctgagatat
atttgctagatttctaaagaatgtgttctaaaacagcagaagattttcaagaaccggttt
ccaaagacagtcttctaattcctcattagtaataagtaaaatgtttattgttgtagctct
ggtatataatccattcctcttaaaatataagacctctggcatgaatatttcatatctata
aaatgacagatcccaccaggaaggaagctgttgctttctttgaggtgatttttttccttt
gctccctgttgctgaaaccatacagcttcataaataattttgcttgctgaaggaagaaaa
agtgtttttcataaacccattatccaggactgtttatagctgttggaaggactaggtctt
ccctagcccccccagtgtgcaagggcagtgaagacttgattgtacaaaatacgttttgta
aatgttgtgctgttaacactgcaaataaacttggtagcaaacacttccaaaaaaaaaaaa
aaaaaa
>NM_007298.3|-20|5' UTR
ttcattggaacagaaagaa
>NM_007298.3|2299-|3' UTR
ctgcagccagccacaggtacagagccacaggaccccaagaatgagcttacaaagtggcct
ttccaggccctgggagctcctctcactcttcagtccttctactgtcctggctactaaata
ttttatgtacatcagcctgaaaaggacttctggctatgcaagggtcccttaaagattttc
tgcttgaagtctcccttggaaatctgccatgagcacaaaattatggtaatttttcacctg
agaagattttaaaaccatttaaacgccaccaattgagcaagatgctgattcattatttat
cagccctattctttctattcaggctgttgttggcttagggctggaagcacagagtggctt
ggcctcaagagaatagctggtttccctaagtttacttctctaaaaccctgtgttcacaaa
ggcagagagtcagacccttcaatggaaggagagtgcttgggatcgattatgtgacttaaa
gtcagaatagtccttgggcagttctcaaatgttggagtggaacattggggaggaaattct
gaggcaggtattagaaatgaaaaggaaacttgaaacctgggcatggtggctcacgcctgt
aatcccagcactttgggaggccaaggtgggcagatcactggaggtcaggagttcgaaacc
agcctggccaacatggtgaaaccccatctctactaaaaatacagaaattagccggtcatg
gtggtggacacctgtaatcccagctactcaggtggctaaggcaggagaatcacttcagcc
cgggaggtggaggttgcagtgagccaagatcataccacggcactccagcctgggtgacag
tgagactgtggctcaaaaaaaaaaaaaaaaaaaggaaaatgaaactagaagagatttcta
aaagtctgagatatatttgctagatttctaaagaatgtgttctaaaacagcagaagattt
tcaagaaccggtttccaaagacagtcttctaattcctcattagtaataagtaaaatgttt
attgttgtagctctggtatataatccattcctcttaaaatataagacctctggcatgaat
atttcatatctataaaatgacagatcccaccaggaaggaagctgttgctttctttgaggt
gatttttttcctttgctccctgttgctgaaaccatacagcttcataaataattttgcttg
ctgaaggaagaaaaagtgtttttcataaacccattatccaggactgtttatagctgttgg
aaggactaggtcttccctagcccccccagtgtgcaagggcagtgaagacttgattgtaca
aaatacgttttgtaaatgttgtgctgttaacactgcaaataaacttggtagcaaacactt
ccaaaaaaaaaaaaaaaaaa
view raw ~output.txt hosted with ❤ by GitHub

ADD COMMENT
0
Entering edit mode

Hi Pierre, thank you so much! This xlst seems like a promising tool. Please help me understand a few queries:

a) "using XSLT , assuming it's mRNA, 5'->3', with the correct annotation. I'm extracting the position of the left and right CDS:"

I have never actually given thought to, whether mRNA 5' -> 3' or vice versa. I thought the file from which the UTRs are to be extracted are well annotated with all necessary information. Do I need to be careful when extracting such info from entrez?

b) I ran your script and it worked like charm, however, could you please help me understand briefly what is the script "transform.xls" doing? It will help me to have a clear idea.

c) Please let me know how you got the refseqids for this search here:

$ wget -q -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=`NM_007299.3`&id=**NR_027676.1**&id=**NM_007299.3**&id=`NM_007298.3`&retmode=xml" | xsltproc --novalid transform.xsl - | fold -w 60

d) After I have got the 5' and 3' UTRs, I need to extract the SNPs and their positions,could you please share how this could be done using xslt?

Thanks much!

ADD REPLY
0
Entering edit mode

a)

I thought the file from which the UTRs are to be extracted are well annotated

they're not: there is no CDS in NR_027676.1

b) I'm going to add some comment, please update in a few minutes

c) how you got the refseqids

i've just peeked a few randow mRNA accessions using entrez "mRNA BRCA1"

d) no, because that's not your original question. Ask a new question.

ADD REPLY
0
Entering edit mode

Ok, thanks for the explanation. For the SNPs in 3' and 5' UTRs and their positions, I have asked a new question, please share the solution. Thank you :-)

SNPs in UTR

ADD REPLY

Login before adding your answer.

Traffic: 3533 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6