Question

Retrieving features from a genbank files

0

Entering edit mode

8.2 years ago

muhammad.ali • 0

I have downloaded a GenBank file from NCBI containing multiple sequences. I want to convert this file into a table (data.frame) having column headings e.g. LOCUS, ACCESSION, FEATURES etc. Can somebody recommend me any solution for it.

R sequence • 4.0k views

ADD COMMENT • link updated 8.2 years ago by Pierre Lindenbaum 166k • written 8.2 years ago by muhammad.ali • 0

1

Entering edit mode

Can this tutorial help you?

ADD REPLY • link 8.2 years ago by e.rempel ★ 1.1k

0

Entering edit mode

This was very useful tutorial. But I'm more interested in metadata, e.g. isolation source, location, Lat, Long, country, date etc.

ADD REPLY • link 8.2 years ago by muhammad.ali • 0

0

Entering edit mode

i have written this python script which creates a .csv file, you can open in R to create a data frame https://github.com/dewshr/NCBI-Genbank-file-parser

ADD REPLY • link 7.7 years ago by dewshrs • 0

score 2 · Answer 1 · 2017-04-24

create a tab delimted table use XSLT:

e.g:


<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform' version='1.1' >
<xsl:output method="text" encoding="UTF-8"/>


<xsl:template match="/">

<xsl:apply-templates select="GBSet"/>
</xsl:template>


<xsl:template match="GBSet">
<xsl:apply-templates select="GBSeq"/>
</xsl:template>

<xsl:template match="GBSeq">
<xsl:for-each select="GBSeq_feature-table/GBFeature">
<xsl:value-of select="../../GBSeq_locus"/>
<xsl:text>  </xsl:text>
<xsl:value-of select="../../GBSeq_primary-accession"/>
<xsl:text>  </xsl:text>
<xsl:value-of select="GBFeature_key"/>
<xsl:text>  </xsl:text>
<xsl:value-of select="GBFeature_location"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

e.g:

$ curl -s  "https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NC_001664.2,AE001273&retmode=xml" | xsltproc --novalid transorm.xsl -  

NC_001664   NC_001664   source  1..159322
NC_001664   NC_001664   repeat_region   1..8088
NC_001664   NC_001664   repeat_region   56..342
NC_001664   NC_001664   gene    501..6850
NC_001664   NC_001664   CDS join(501..759,843..2653)
NC_001664   NC_001664   gene    4725..6850
NC_001664   NC_001664   CDS join(4725..5028,5837..6720)
NC_001664   NC_001664   regulatory  6845..6850
NC_001664   NC_001664   repeat_region   7655..8008
NC_001664   NC_001664   misc_feature    8009..151234
(...)