Background:
Given a refseq genome at NCBI what I'm trying to do is to extract the position of certain genes. In NCBI (example: https://www.ncbi.nlm.nih.gov/nuccore/AE001273.1?report=genbank&log$=seqview) there's first some information of the genome, then there's a list of features. These are what I'm interested in. You have entries like these:
Here what I'm interested in is that from positions 854128 to 855677 there is a 16SrRNA. I want to download this information (the genebank, .gb file, right?). Next I want an iterator over all entries. I want to iterate over them, and if the entry are of a type which I'm looking (above: gene="16SrRNA...) for I want to record the positions (above: 854128 and 855677). I also want the full corresponding genome (I'm only thinking of prokaryot genomes).
Now I could probably write a program for this, but it feels like something where there should be tools to perform the task, and if such exists it would be better to use them.
Is there such tools (I use the latest version of Java, an API for it would be useful). I found someone who wrote you could use BioJava 1.8. But I can only find instructions and downloads for the latest version 4.x (as well as some earlier 3.x). I've found some documentation on BioJava 4.2 and mentioning on GenBank files, but no instruction on how to do anything (especially like what I'm looking for). Anyone who know anything?
You could use biojava and a series of iterators.
I would do this to get the features of a "CDS" coding sequence, so it should work for " gene" or "rRNA" too.
LinkedHashMap<String, DNASequence> codingSequences = GenbankReaderHelper.readGenbankDNASequence(file);
for (DNASequence cds : codingSequences.values()) {
for (FeatureInterface<AbstractSequence<NucleotideCompound>, NucleotideCompound> cdsFeature : cds.getFeatures()) {
if (cdsFeature.getType().equals("gene")){
Map<String, List<Qualifier>> qualifiers = cdsFeature.getQualifiers();
List<Qualifier> geneNameList = qualifiers.get("gene");
if (geneNameList != null) {
for (Qualifier geneName : geneNameList){
if (geneName.getValue() == "16SrRNA"){
int loc1 = cdsFeature.getLocations().getStart().getPosition();
int loc2 = cdsFeature.getLocations().getEnd().getPosition();
}
}
}
}
}
}
}
I would also suggest to use a debugger for example like a eclipse - make a break point in one of these qualifiers to dig into exactly what you need and what each variable is called in BioJava.
You could use eutils straight from NCBI.