Retrieving features from a genbank files
1
0
Entering edit mode
7.6 years ago

I have downloaded a GenBank file from NCBI containing multiple sequences. I want to convert this file into a table (data.frame) having column headings e.g. LOCUS, ACCESSION, FEATURES etc. Can somebody recommend me any solution for it.

R sequence • 3.8k views
ADD COMMENT
1
Entering edit mode

Can this tutorial help you?

ADD REPLY
0
Entering edit mode

This was very useful tutorial. But I'm more interested in metadata, e.g. isolation source, location, Lat, Long, country, date etc.

ADD REPLY
0
Entering edit mode

i have written this python script which creates a .csv file, you can open in R to create a data frame https://github.com/dewshr/NCBI-Genbank-file-parser

ADD REPLY
2
Entering edit mode
7.6 years ago

create a tab delimted table use XSLT:

e.g:


<xsl:stylesheet xmlns:xsl="&lt;a href=" <a="" href="http://www.w3.org/1999/XSL/Transform" rel="nofollow">http://www.w3.org/1999/XSL/Transform" "="" rel="nofollow">http://www.w3.org/1999/XSL/Transform' version='1.1' >
<xsl:output method="text" encoding="UTF-8"/>


<xsl:template match="/">

<xsl:apply-templates select="GBSet"/>
</xsl:template>


<xsl:template match="GBSet">
<xsl:apply-templates select="GBSeq"/>
</xsl:template>

<xsl:template match="GBSeq">
<xsl:for-each select="GBSeq_feature-table/GBFeature">
<xsl:value-of select="../../GBSeq_locus"/>
<xsl:text>  </xsl:text>
<xsl:value-of select="../../GBSeq_primary-accession"/>
<xsl:text>  </xsl:text>
<xsl:value-of select="GBFeature_key"/>
<xsl:text>  </xsl:text>
<xsl:value-of select="GBFeature_location"/>
<xsl:text>
</xsl:text>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>

e.g:

$ curl -s  "https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=NC_001664.2,AE001273&retmode=xml" | xsltproc --novalid transorm.xsl -  

NC_001664   NC_001664   source  1..159322
NC_001664   NC_001664   repeat_region   1..8088
NC_001664   NC_001664   repeat_region   56..342
NC_001664   NC_001664   gene    501..6850
NC_001664   NC_001664   CDS join(501..759,843..2653)
NC_001664   NC_001664   gene    4725..6850
NC_001664   NC_001664   CDS join(4725..5028,5837..6720)
NC_001664   NC_001664   regulatory  6845..6850
NC_001664   NC_001664   repeat_region   7655..8008
NC_001664   NC_001664   misc_feature    8009..151234
(...)
ADD COMMENT
0
Entering edit mode

hum.. biostars-engine messed-up the XML code...

ADD REPLY

Login before adding your answer.

Traffic: 1930 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6