entrez utilities: snp

Entering edit mode

7.7 years ago

DanielC ▴ 210

Dear All,

I am trying to get the 3' and 5' UTR of the BRCA1 and BRCA2 mRNA. I came to know about entrez utilities to do it like this:

source: https://www.ncbi.nlm.nih.gov/books/NBK179288/

and the code is:

ThreePrimeUTRs() {
    xtract -pattern INSDSeq -ACC INSDSeq_accession-version -SEQ INSDSeq_sequence \
      -group INSDFeature -if INSDFeature_key -equals CDS -PRD "(-)" \
        -block INSDQualifier -if INSDQualifier_name \
          -equals product -PRD INSDQualifier_value \
        -block INSDFeature -pfc "\n" -element "&ACC" -rst \
          -last INSDInterval_to -element "&SEQ" "&PRD" |
    while read acc pos seq prd
    do
      if [ $pos -lt ${#seq} ]
      then
        echo -e ">$acc 3'UTR: $((pos+1))..${#seq} $prd"
        echo "${seq:$pos}" | fold -w 50
      elif [ $pos -ge ${#seq} ]
      then
        echo -e ">$acc NO 3'UTR"
      fi
    done
  }

  esearch -db nuccore -query "3.6.4.12 [ECNO]" |
  efilter -molecule mrna -source refseq |
  efetch -format gbc | ThreePrimeUTRs

When I run this I keep getting error saying;

**Unrecognized argument '-if'
No -element before 'INSDFeature_key'
Unrecognized argument '-equals'
No -element before 'CDS'**

Can someone please help me know what is going wrong? And, can I get the 5' UTR following the same code? And, finally, I also want to get the SNPs in the 3' and 5' UTR?

Thank you so much! DK

SNP • 2.9k views

ADD COMMENT • link updated 7.7 years ago by Pierre Lindenbaum 166k • written 7.7 years ago by DanielC ▴ 210

Entering edit mode

If you put the code in a file, make it executable and run it, it produces a result.

>XM_005708748.1 3'UTR: 1918..1941 ATP-dependent DNA helicase RecQ
gtgtggttttcaacaagttttaca
>XM_005707168.1 3'UTR: 3700..3734 ATP-dependent DNA helicase RecQ
gttgctttgggtttcacaaggtaaatttatgacaa
>XM_005706277.1 3'UTR: 1799..1836 ATP-dependent DNA helicase 2 subunit 1 isoform 1
gaacggccagtatacaacacccagatcagccaaatcaa
>XM_005706276.1 3'UTR: 1409..1481 ATP-dependent DNA helicase 2 subunit 1 isoform 2
tccgtcaaaatattcggatcctgatattcaacgatattataacggattac
aagctctggctctgaatcaaacc
>XM_005705233.1 3'UTR: 1632..1709 ATP-dependent DNA helicase RecQ
ctttattgtatgagaattttctgaatttctttgcagacatttctttcgca
tgtatcttataaacaactataagattgt
>NM_001278454.1 3'UTR: 6111..8976 chromodomain-helicase-DNA-binding protein 2
agcgactgagaaggggggggggaaacacgtcttgaaagacttggatgcaa
caaccagaaactctgaacatgctgctatcatcttgctgggtcaaggagga
ttttggaggagcaggtggaggaagactcagttctaatttgggttcccatt
ttgtttccccccctttctctcgttgaacattggaaccagacttgcctcgt
tctttttctttggtttgttttccccaatccaacggacacgtggagaattt
tcctcagccacagtgtttccccaaaaccgagaaggcggatcaatgctgct

truncated for brevity.

You will need to change your query (e.g. -query "BRCA") to get what you need.

ADD REPLY • link 7.7 years ago by GenoMax 151k

Entering edit mode

Thanks! but when I run I keep getting this error above

Unrecognized argument '-if' No -element before 'INSDFeature_key' Unrecognized argument '-equals' No -element before 'CDS'*

ADD REPLY • link 7.7 years ago by DanielC ▴ 210

Entering edit mode

Are you using the bash shell? If not, issue the command bash and then run the file at the new system prompt that should show up.

ADD REPLY • link 7.7 years ago by GenoMax 151k

Entering edit mode

OK, Thanks. I will try and update here.

ADD REPLY • link 7.7 years ago by DanielC ▴ 210

Entering edit mode

Isn't this question same as SNPs; entrez utilities ?

ADD REPLY • link 7.7 years ago by Sej Modha 5.3k

Entering edit mode

Not exactly, because in that I had no idea of the approach. Here, I have found a way, but having errors and missing features to get the SNPs.

ADD REPLY • link 7.7 years ago by DanielC ▴ 210

Entering edit mode

I have it on good information that if you update your implementation of eutils, this should work fine.

ADD REPLY • link 7.7 years ago by DCGenomics ▴ 330

Entering edit mode

7.7 years ago

Pierre Lindenbaum 166k

using XSLT , assuming it's mRNA, 5'->3', with the correct annotation. I'm extracting the position of the left and right CDS:

	<?xml version='1.0' encoding="UTF-8"?>
	<xsl:stylesheet xmlns:xsl='http://www.w3.org/1999/XSL/Transform' version='1.1'>
	<xsl:output method="text" encoding="UTF-8"/>

	<!-- pattern for root node -->
	<xsl:template match="/">
	<!-- when the XML root node is matched, search a pattern for the XML element(s) GBSet/GBSeq -->
	<xsl:apply-templates select="/GBSet/GBSeq"/>
	</xsl:template>

	<!-- pattern for root GBSeq -->
	<xsl:template match="GBSeq">
	<!-- create a variable acn containing the Accession number -->
	<xsl:variable name="acn" select="GBSeq_accession-version/text()"/>

	<!--create a variable containing CDS start -->

	<xsl:variable name="cdsstart">
	<!-- search all GBInterval_from values of the Features named "CDS" -->
	<xsl:for-each select="GBSeq_feature-table/GBFeature[GBFeature_key='CDS']/GBFeature_intervals/GBInterval/GBInterval_from/text()">
	<!-- sort as a number -->
	<xsl:sort data-type="number" order="descending"/>
	<!-- print the firtst value only (the smallest CDS start ) -->
	<xsl:if test="position()=1"><xsl:value-of select="."/></xsl:if>
	</xsl:for-each>
	</xsl:variable>

	<!--create a variable containing CDS end -->

	<xsl:variable name="cdsend">
	<!-- search all GBInterval_to values of the Features named "CDS" -->
	<xsl:for-each select="GBSeq_feature-table/GBFeature[GBFeature_key='CDS']/GBFeature_intervals/GBInterval/GBInterval_to/text()">
	<!-- sort as a number , inverse order-->
	<xsl:sort data-type="number" order="ascending"/>
	<!-- print the firtst value only (the greater CDS end ) -->
	<xsl:if test="position()=1"><xsl:value-of select="."/></xsl:if>
	</xsl:for-each>
	</xsl:variable>

	<!-- print UTR 5-->
	<xsl:text>></xsl:text>
	<xsl:value-of select="concat($acn,'\|-',$cdsstart)"/><xsl:text>\|5' UTR
	</xsl:text>
	<xsl:value-of select="substring(GBSeq_sequence/text(),1,number($cdsstart) - 1 ) "/>

	<!-- print UTR 3-->
	<xsl:text>
	></xsl:text>
	<xsl:value-of select="concat($acn,'\|',$cdsend)"/><xsl:text>-\|3' UTR
	</xsl:text>
	<xsl:value-of select="substring(GBSeq_sequence/text(),number($cdsend) + 1 ) "/>
	<xsl:text>
	</xsl:text>

	</xsl:template>

	</xsl:stylesheet>

view raw transform.xsl hosted with ❤ by GitHub

	$ wget -q -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NM_007299.3&id=NR_027676.1&id=NM_007299.3&id=NM_007298.3&retmode=xml" \| xsltproc --novalid transform.xsl - \| fold -w 60

	>NM_007299.3\|-195\|5' UTR
	cttagcggtagccccttggtttccgtggcaacggaaaagcgcgggaattacagataaatt
	aaaactgcgactgcgcggcgtgagctcgctgagacttcctggacgggggacaggctgtgg
	ggtttctcagataactgggcccctgcgctcaggaggccttcaccctctgctctggttcat
	tggaacagaaagaa
	>NM_007299.3\|2294-\|3' UTR
	ggcacctgtggtgacccgagagtgggtgttggacagtgtagcactctaccagtgccagga
	gctggacacctacctgataccccagatcccccacagccactactgactgcagccagccac
	aggtacagagccacaggaccccaagaatgagcttacaaagtggcctttccaggccctggg
	agctcctctcactcttcagtccttctactgtcctggctactaaatattttatgtacatca
	gcctgaaaaggacttctggctatgcaagggtcccttaaagattttctgcttgaagtctcc
	cttggaaatctgccatgagcacaaaattatggtaatttttcacctgagaagattttaaaa
	ccatttaaacgccaccaattgagcaagatgctgattcattatttatcagccctattcttt
	ctattcaggctgttgttggcttagggctggaagcacagagtggcttggcctcaagagaat
	agctggtttccctaagtttacttctctaaaaccctgtgttcacaaaggcagagagtcaga
	cccttcaatggaaggagagtgcttgggatcgattatgtgacttaaagtcagaatagtcct
	tgggcagttctcaaatgttggagtggaacattggggaggaaattctgaggcaggtattag
	aaatgaaaaggaaacttgaaacctgggcatggtggctcacgcctgtaatcccagcacttt
	gggaggccaaggtgggcagatcactggaggtcaggagttcgaaaccagcctggccaacat
	ggtgaaaccccatctctactaaaaatacagaaattagccggtcatggtggtggacacctg
	taatcccagctactcaggtggctaaggcaggagaatcacttcagcccgggaggtggaggt
	tgcagtgagccaagatcataccacggcactccagcctgggtgacagtgagactgtggctc
	aaaaaaaaaaaaaaaaaaaggaaaatgaaactagaagagatttctaaaagtctgagatat
	atttgctagatttctaaagaatgtgttctaaaacagcagaagattttcaagaaccggttt
	ccaaagacagtcttctaattcctcattagtaataagtaaaatgtttattgttgtagctct
	ggtatataatccattcctcttaaaatataagacctctggcatgaatatttcatatctata
	aaatgacagatcccaccaggaaggaagctgttgctttctttgaggtgatttttttccttt
	gctccctgttgctgaaaccatacagcttcataaataattttgcttgctgaaggaagaaaa
	agtgtttttcataaacccattatccaggactgtttatagctgttggaaggactaggtctt
	ccctagcccccccagtgtgcaagggcagtgaagacttgattgtacaaaatacgttttgta
	aatgttgtgctgttaacactgcaaataaacttggtagcaaacacttccaaaaaaaaaaaa
	aaaaaa
	>NR_027676.1\|-\|5' UTR

	>NR_027676.1\|-\|3' UTR

	>NM_007299.3\|-195\|5' UTR
	cttagcggtagccccttggtttccgtggcaacggaaaagcgcgggaattacagataaatt
	aaaactgcgactgcgcggcgtgagctcgctgagacttcctggacgggggacaggctgtgg
	ggtttctcagataactgggcccctgcgctcaggaggccttcaccctctgctctggttcat
	tggaacagaaagaa
	>NM_007299.3\|2294-\|3' UTR
	ggcacctgtggtgacccgagagtgggtgttggacagtgtagcactctaccagtgccagga
	gctggacacctacctgataccccagatcccccacagccactactgactgcagccagccac
	aggtacagagccacaggaccccaagaatgagcttacaaagtggcctttccaggccctggg
	agctcctctcactcttcagtccttctactgtcctggctactaaatattttatgtacatca
	gcctgaaaaggacttctggctatgcaagggtcccttaaagattttctgcttgaagtctcc
	cttggaaatctgccatgagcacaaaattatggtaatttttcacctgagaagattttaaaa
	ccatttaaacgccaccaattgagcaagatgctgattcattatttatcagccctattcttt
	ctattcaggctgttgttggcttagggctggaagcacagagtggcttggcctcaagagaat
	agctggtttccctaagtttacttctctaaaaccctgtgttcacaaaggcagagagtcaga
	cccttcaatggaaggagagtgcttgggatcgattatgtgacttaaagtcagaatagtcct
	tgggcagttctcaaatgttggagtggaacattggggaggaaattctgaggcaggtattag
	aaatgaaaaggaaacttgaaacctgggcatggtggctcacgcctgtaatcccagcacttt
	gggaggccaaggtgggcagatcactggaggtcaggagttcgaaaccagcctggccaacat
	ggtgaaaccccatctctactaaaaatacagaaattagccggtcatggtggtggacacctg
	taatcccagctactcaggtggctaaggcaggagaatcacttcagcccgggaggtggaggt
	tgcagtgagccaagatcataccacggcactccagcctgggtgacagtgagactgtggctc
	aaaaaaaaaaaaaaaaaaaggaaaatgaaactagaagagatttctaaaagtctgagatat
	atttgctagatttctaaagaatgtgttctaaaacagcagaagattttcaagaaccggttt
	ccaaagacagtcttctaattcctcattagtaataagtaaaatgtttattgttgtagctct
	ggtatataatccattcctcttaaaatataagacctctggcatgaatatttcatatctata
	aaatgacagatcccaccaggaaggaagctgttgctttctttgaggtgatttttttccttt
	gctccctgttgctgaaaccatacagcttcataaataattttgcttgctgaaggaagaaaa
	agtgtttttcataaacccattatccaggactgtttatagctgttggaaggactaggtctt
	ccctagcccccccagtgtgcaagggcagtgaagacttgattgtacaaaatacgttttgta
	aatgttgtgctgttaacactgcaaataaacttggtagcaaacacttccaaaaaaaaaaaa
	aaaaaa
	>NM_007298.3\|-20\|5' UTR
	ttcattggaacagaaagaa
	>NM_007298.3\|2299-\|3' UTR
	ctgcagccagccacaggtacagagccacaggaccccaagaatgagcttacaaagtggcct
	ttccaggccctgggagctcctctcactcttcagtccttctactgtcctggctactaaata
	ttttatgtacatcagcctgaaaaggacttctggctatgcaagggtcccttaaagattttc
	tgcttgaagtctcccttggaaatctgccatgagcacaaaattatggtaatttttcacctg
	agaagattttaaaaccatttaaacgccaccaattgagcaagatgctgattcattatttat
	cagccctattctttctattcaggctgttgttggcttagggctggaagcacagagtggctt
	ggcctcaagagaatagctggtttccctaagtttacttctctaaaaccctgtgttcacaaa
	ggcagagagtcagacccttcaatggaaggagagtgcttgggatcgattatgtgacttaaa
	gtcagaatagtccttgggcagttctcaaatgttggagtggaacattggggaggaaattct
	gaggcaggtattagaaatgaaaaggaaacttgaaacctgggcatggtggctcacgcctgt
	aatcccagcactttgggaggccaaggtgggcagatcactggaggtcaggagttcgaaacc
	agcctggccaacatggtgaaaccccatctctactaaaaatacagaaattagccggtcatg
	gtggtggacacctgtaatcccagctactcaggtggctaaggcaggagaatcacttcagcc
	cgggaggtggaggttgcagtgagccaagatcataccacggcactccagcctgggtgacag
	tgagactgtggctcaaaaaaaaaaaaaaaaaaaggaaaatgaaactagaagagatttcta
	aaagtctgagatatatttgctagatttctaaagaatgtgttctaaaacagcagaagattt
	tcaagaaccggtttccaaagacagtcttctaattcctcattagtaataagtaaaatgttt
	attgttgtagctctggtatataatccattcctcttaaaatataagacctctggcatgaat
	atttcatatctataaaatgacagatcccaccaggaaggaagctgttgctttctttgaggt
	gatttttttcctttgctccctgttgctgaaaccatacagcttcataaataattttgcttg
	ctgaaggaagaaaaagtgtttttcataaacccattatccaggactgtttatagctgttgg
	aaggactaggtcttccctagcccccccagtgtgcaagggcagtgaagacttgattgtaca
	aaatacgttttgtaaatgttgtgctgttaacactgcaaataaacttggtagcaaacactt
	ccaaaaaaaaaaaaaaaaaa

view raw ~output.txt hosted with ❤ by GitHub

ADD COMMENT • link 7.7 years ago by Pierre Lindenbaum 166k

Entering edit mode

Hi Pierre, thank you so much! This xlst seems like a promising tool. Please help me understand a few queries:

a) "using XSLT , assuming it's mRNA, 5'->3', with the correct annotation. I'm extracting the position of the left and right CDS:"

I have never actually given thought to, whether mRNA 5' -> 3' or vice versa. I thought the file from which the UTRs are to be extracted are well annotated with all necessary information. Do I need to be careful when extracting such info from entrez?

b) I ran your script and it worked like charm, however, could you please help me understand briefly what is the script "transform.xls" doing? It will help me to have a clear idea.

c) Please let me know how you got the refseqids for this search here:

$ wget -q -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=`NM_007299.3`&id=**NR_027676.1**&id=**NM_007299.3**&id=`NM_007298.3`&retmode=xml" | xsltproc --novalid transform.xsl - | fold -w 60

d) After I have got the 5' and 3' UTRs, I need to extract the SNPs and their positions,could you please share how this could be done using xslt?

Thanks much!

ADD REPLY • link 7.7 years ago by DanielC ▴ 210

Entering edit mode

I thought the file from which the UTRs are to be extracted are well annotated

they're not: there is no CDS in NR_027676.1

b) I'm going to add some comment, please update in a few minutes

c) how you got the refseqids

i've just peeked a few randow mRNA accessions using entrez "mRNA BRCA1"

d) no, because that's not your original question. Ask a new question.

ADD REPLY • link 7.7 years ago by Pierre Lindenbaum 166k

Entering edit mode

Ok, thanks for the explanation. For the SNPs in 3' and 5' UTRs and their positions, I have asked a new question, please share the solution. Thank you :-)

SNPs in UTR

ADD REPLY • link 7.7 years ago by DanielC ▴ 210