Reading Genbank files
2
0
Entering edit mode
8.2 years ago
torkel.loman ▴ 10

Background: Given a refseq genome at NCBI what I'm trying to do is to extract the position of certain genes. In NCBI (example: https://www.ncbi.nlm.nih.gov/nuccore/AE001273.1?report=genbank&log$=seqview) there's first some information of the genome, then there's a list of features. These are what I'm interested in. You have entries like these:

    gene                 854128..855677
                         /gene="16SrRNA_1"
                         /locus_tag="CT_r01"
    rRNA                 854128..855677
                         /gene="16SrRNA_1"
                         /locus_tag="CT_r01"
                         /product="16S ribosomal RNA"

Here what I'm interested in is that from positions 854128 to 855677 there is a 16SrRNA. I want to download this information (the genebank, .gb file, right?). Next I want an iterator over all entries. I want to iterate over them, and if the entry are of a type which I'm looking (above: gene="16SrRNA...) for I want to record the positions (above: 854128 and 855677). I also want the full corresponding genome (I'm only thinking of prokaryot genomes).

Now I could probably write a program for this, but it feels like something where there should be tools to perform the task, and if such exists it would be better to use them.

Is there such tools (I use the latest version of Java, an API for it would be useful). I found someone who wrote you could use BioJava 1.8. But I can only find instructions and downloads for the latest version 4.x (as well as some earlier 3.x). I've found some documentation on BioJava 4.2 and mentioning on GenBank files, but no instruction on how to do anything (especially like what I'm looking for). Anyone who know anything?

genbank java genome gene • 3.0k views
ADD COMMENT
0
Entering edit mode

You could use eutils straight from NCBI.

ADD REPLY
0
Entering edit mode
8.2 years ago

Generate a parser with xjc :

$ xjc -dtd "https://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd"
parsing a schema...
compiling a schema...
generated/INSDAltSeqData.java
generated/INSDAltSeqDataItems.java
generated/INSDAltSeqItem.java
generated/INSDAltSeqItemInterval.java
generated/INSDAltSeqItemIsgap.java
generated/INSDAuthor.java
generated/INSDComment.java
generated/INSDCommentParagraph.java
generated/INSDCommentParagraphs.java
generated/INSDFeature.java
generated/INSDFeatureIntervals.java
generated/INSDFeaturePartial3.java
generated/INSDFeaturePartial5.java
generated/INSDFeatureQuals.java
generated/INSDFeatureSet.java
generated/INSDFeatureSetFeatures.java
generated/INSDFeatureXrefs.java
generated/INSDInterval.java
generated/INSDIntervalInterbp.java
generated/INSDIntervalIscomp.java
generated/INSDKeyword.java
generated/INSDQualifier.java
generated/INSDReference.java
generated/INSDReferenceAuthors.java
generated/INSDReferenceXref.java
generated/INSDSecondaryAccn.java
generated/INSDSeq.java
generated/INSDSeqAltSeq.java
generated/INSDSeqCommentSet.java
generated/INSDSeqFeatureSet.java
generated/INSDSeqFeatureTable.java
generated/INSDSeqKeywords.java
generated/INSDSeqOtherSeqids.java
generated/INSDSeqReferences.java
generated/INSDSeqSecondaryAccessions.java
generated/INSDSeqStrucComments.java
generated/INSDSeqXrefs.java
generated/INSDSeqid.java
generated/INSDSet.java
generated/INSDStrucComment.java
generated/INSDStrucCommentItem.java
generated/INSDStrucCommentItems.java
generated/INSDXref.java
generated/ObjectFactory.java

now you can create a streaming parser scanning the output of eutils: https://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=AE001273.1&retmode=xml

see How To Get A Dbsnp Data From Java? for an example.

ADD COMMENT
0
Entering edit mode

Thanks for the answer. There seems no easy, already made function for oingt his. Like you suggested I wrote my own parser.

ADD REPLY
0
Entering edit mode
8.1 years ago

You could use biojava and a series of iterators. I would do this to get the features of a "CDS" coding sequence, so it should work for " gene" or "rRNA" too.

LinkedHashMap<String, DNASequence> codingSequences = GenbankReaderHelper.readGenbankDNASequence(file);
for (DNASequence cds : codingSequences.values()) {
    for (FeatureInterface<AbstractSequence<NucleotideCompound>, NucleotideCompound> cdsFeature : cds.getFeatures()) {
        if (cdsFeature.getType().equals("gene")){   
            Map<String, List<Qualifier>> qualifiers = cdsFeature.getQualifiers();
            List<Qualifier> geneNameList = qualifiers.get("gene");      
            if (geneNameList != null) {
                for (Qualifier geneName : geneNameList){
                    if (geneName.getValue() == "16SrRNA"){
                        int loc1 = cdsFeature.getLocations().getStart().getPosition();
                        int loc2 = cdsFeature.getLocations().getEnd().getPosition();
                       }
                    }
                }
            }
        }
    }
}

I would also suggest to use a debugger for example like a eclipse - make a break point in one of these qualifiers to dig into exactly what you need and what each variable is called in BioJava.

I hope this helps

ADD COMMENT

Login before adding your answer.

Traffic: 1352 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6