|
I would just dump all the *.gbk files in a folder whether they are from NCBI, or annotated contigs for metagenomes using software such as PROKKA (http://www.vicbioinformatics.com/software.prokka.shtml) and get a simpler tabulated structure to work with (first column being LOCUS and all other columns containing the keywords) using the one-liners mentioned below: |
|
(Check my page http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/annotation.html on how I use EC Numbers with a KEGG-style API) |
|
|
|
Say I start with some random gbk files downloaded from NCBI: |
|
|
|
[uzi@quince-srv2 ~/oneliners/GBK_files]$ ls |
|
NC_013164.gbk NC_014259.gbk NC_017100.gbk NC_021150.gbk NC_021285.gbk |
|
|
|
In the all the one-liners below, |
|
remove: | perl -alne '{print join("\t",@F[0..5])}' |
|
to store in a file, replace with: > table.tsv |
|
to store in a file with unique keywords: | perl -MList::MoreUtils=uniq -alne '{print join("\t",uniq @F)}' > table.tsv |
|
|
|
Note that: |
|
-If some genome/contig doesn't contain a feature awk 'NF>1{print $0}' will exclude that from the output |
|
-The perl command with @F[0..5] is not required as it just to show the structure of output without cluttering this demonstration |
|
|
|
Extracting /db_xref="taxons:*" |
|
|
|
[uzi@quince-srv2 ~/oneliners/GBK_files]$ cat *.gbk | awk '/taxon/{print gensub(" +/db_xref=\"taxon:(.*)\"","\1","g")}/^LOCUS/{print ";"$2}' | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}' | perl -alne '{print join("\t",@F[0..5])}' |
|
NC_013164 525919 |
|
NC_014259 436717 |
|
NC_017100 634453 |
|
NC_021150 1283331 |
|
NC_021285 1167634 |
|
|
|
Extracting /EC_number="*": |
|
|
|
[uzi@quince-srv2 ~/oneliners/GBK_files]$ cat *.gbk | awk '/EC_number/{print gensub(" +/EC_number=\"(.*)\"","\1","g")}/^LOCUS/{print ";"$2}' | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}' | perl -alne '{print join("\t",@F[0..5])}' |
|
NC_013164 2.1.1.37 2.7.7.49 |
|
NC_014259 6.1.1.1 4.2.1.9 4.1.1.31 1.3.1.26 1.1.1.274 |
|
NC_021285 3.1.4.- 3.6.3.25 2.4.1.- 2.7.7.7 2.3.1.1 |
|
|
|
Extracting /db_xref="GenID:*" |
|
|
|
[uzi@quince-srv2 ~/oneliners/GBK_files]$ cat *.gbk | awk '/GeneID/{print gensub(" +/db_xref=\"GeneID:(.*)\"","\1","g")}/^LOCUS/{print ";"$2}' | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}' | perl -alne '{print join("\t",@F[0..5])}' |
|
NC_013164 8368769 8368769 8368657 8368657 8368658 |
|
NC_014259 9384321 9384321 9380389 9380389 9380390 |
|
NC_017100 12066596 12066596 12066598 12066598 12066600 |
|
NC_021150 15372752 15372752 15367888 15367888 15370905 |
|
NC_021285 16467804 16467804 16467805 16467805 16467806 |
|
|
|
Extracting /db_xref="GI:*" |
|
|
|
[uzi@quince-srv2 ~/oneliners/GBK_files]$ cat *.gbk | awk '/\/db_xref=\"GI/{print gensub(" +/db_xref=\"GI:(.*)\"","\1","g")}/^LOCUS/{print ";"$2}' | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}' | perl -alne '{print join("\t",@F[0..5])}' |
|
NC_013164 256821124 256821125 256821126 256821127 256821128 |
|
NC_014259 299768251 299768252 299768253 299768254 299768255 |
|
NC_017100 384049542 384049543 384049544 384049545 384049546 |
|
NC_021150 482897842 482899070 482898704 482898122 482899071 |
|
NC_021285 528981797 528981798 528981799 528981800 528981801 |
|
|
|
Extracting CDDs: |
|
|
|
[uzi@quince-srv2 ~/oneliners/GBK_files]$ cat *.gbk | awk '/CDD/{print gensub(" +/db_xref=\"CDD:(.*)\"","\1","g")}/^LOCUS/{print ";"$2}' | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}' | perl -alne '{print join("\t",@F[0..5])}' |
|
NC_013164 200366 100105 100105 234461 200370 |
|
NC_014259 221153 232941 99707 99707 99707 |
|
NC_017100 218785 238416 223655 238416 238416 |
|
NC_021150 221153 232941 99707 99707 99707 |
|
NC_021285 183285 239900 239900 239900 170049 |
|
|
|
Extracting /gene="*" |
|
|
|
[uzi@quince-srv2 ~/oneliners/GBK_files]$ cat *.gbk | awk '/\/gene=/{print gensub(" +/gene=\"(.*)\"","\1","g")}/^LOCUS/{print ";"$2}' | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}' | perl -alne '{print join("\t",@F[0..5])}' |
|
NC_014259 dnaA dnaA dnaA dnaA dnaA |
|
NC_017100 rusA rusA rusA bcr/cflA bcr/cflA |
|
NC_021150 dnaA dnaA dnaA dnaA dnaA |
|
|
|
Extracting unique /gene="*" |
|
|
|
[uzi@quince-srv2 ~/oneliners/GBK_files]$ cat *.gbk | awk '/\/gene=/{print gensub(" +/gene=\"(.*)\"","\1","g")}/^LOCUS/{print ";"$2}' | tr '\n' '\t' | tr ';' '\n' | awk 'NF>1{print $0}' | perl -MList::MoreUtils=uniq -alne '{print join("\t",uniq @F)}' |
|
NC_014259 dnaA anmK dkgB nusB ribH psd pyrE murG murC ddl lpxC aceE glmM actP hisH hisB ompR gltX mraW mraY aroK aroB gltB gltD rpoZ gmk ispH rpsP rimM trmD rplS truB glyS glyQ aroE ribA engB rpsJ rplC rplD rplW rplB rpsS rplV rpsC rplP rpsQ rplN rplX rplE rpsN rpsH rplF rplR rpsE rpmD rplO secY rpmJ rpsM rpsD rplQ rplM ksgA apaH secF secD tgt queA aspS trpA mscL gatB gatA thrH valS rplU rpmA sdhA sdhB sucA sucC greA glpD groES groEL dapF tolB ruvB ruvA dapD tmk rnc era obgE mnmA rpmE hemE ampC trpC trpD ubiA cca fadE rpsB tsf queF glyA ppnK thiG gpsA rpsU purH prmA rplI rpsR rpsF prfA miaA pepN fabG murB moaC fumC ispF pyrH frr lpxD fabZ recA ompL argC pyrG eno ispD aroD tynA antC aspA recR hscA rpsT rpsA cmk gidB pgk ruvC purT tauD paaB paaA cysS xseA benD trmB pyrB rlmL alaS rnhB adk ureE ureC ureA lysS metH dut aat rpsL coaD smpB hemA ipk rpmF bioD dcd metG upp argJ hisD hisG rho ihfA pheT pheS rplT rpmI infC thrS panC panB mazG rumA cysM prfC ilvH leuS engA hisS ispG ndk proA clpX clpP tig metX thyA rpmB rpmG gltP truA infA leuD rpmE2 rpsO infB nusA secG tpiA coaE uvrC fadB fadA rpoB rplL rplJ rplA rplK hemC nrdR def atpC argS fur guaA prpB lldD pgi rph ileS lspA dnaK trmE rpmH |
|
NC_017100 rusA bcr/cflA uvrA nifU recA modE/mopA mdoG mdoH recR hrcA exbB/tolQ exbD/tolR rodA ftsI mreD mreC mreB fusB marR degT suv3 nusG secE pfkB cysA cysW cysT araC acrB nodT luxR hlyD/fusE lysR ligB ttg2D djlA queA yjgF ttg2C uvrB impB clpS clpA chvG chvI grpE dnaK dnaJ duf6 fur rho corA phoB phoR oprO corC/hlyC eamA mazG htpG xre iojap aldC oprB ttg1D nifB/moaA dedA matE nhaAP fecCD sss mauE lysE pnuC badM/rrf2 crcB tetR hspA cobW uspA bipA glnBK lepA metW rhlE hppK rpoK era ompH fabZ yidC truA kpsF/gutQ spoU dnaB mscS radA gntR sua5 holC degP hrpB hsp33 recX tatAE ribF mutL ompA comA ahpD xerD secB tim44 mutS2 hicA hicB arsR hap nol1/nop2 dnaA cycH cycL dsbE cysK cycJ ccmC rmuC kdpE rpoB rpoC secY rpoA moxR emrB/qacA greA gatB rpoD yjgPQ minE minD minC carD ftsX ftsE hslV hslU sufE prmA uvrD/rep radC secF secD smc phoH miaB uvrC MoeD MoeE lexA moeA maf thiO thiS recG cobT cobS bolA hemK dnaC tatD osmC exsB/queC smpB purS terC moaC secG recF tolC nusB deoR recQ rpoE phnA apbA phzCF xdhC xdhB xdhA marR/hxlR czcA hlyD emrE asmA fecR/pupR copG rrf5 rrl5 rrs5 cbiD cbiG cobH cobG fis fdhD ftsJ sirA hlyA oppC oppB oppD oppF npd padR fcuA cydC cydD mgtC czcB/hlyD czcC rrf4 rrl4 rrs4 iclR lolD lolCE baf hu lon clpX clpP hesB sufS sufD sufC sufB hlyB/msbA mutS pmbA/tldD asnC mucR mviN plsX smpA/omlA envZ ompR rimM ffh/srp54 atp12 engA sod clpB hflX hfq ntrX ntrY ntrC ntrB nifR3 ispDF ubiHF groES groEL fkbH amtB xerC yajC apaG pfpI ribD obgE/cgtA mraZ mraW ftsW ftsQ ftsA ftsZ recN copA copB recO copZ lytB pirin glgX pcpB sco1/senC rrf3 rrl3 rrs3 dsbD tatC tatB scpB scpA pqqA pqqB pqqC pqqD pqqE aatC recJ rpoH fecE secB-2 pdxJ cobP cobD/cbiB cobQ cobOP cbiA cobN ycdH ureG ureF ureE ureD trbI trbG trbF trbL repA rrf2 rrl2 rrs2 xdhC/coxI truB cvpA ftsY phoU pstB pstA pstC apbC nusA ccmA ccmB thiC ftr1 mutY rrf1 rrl1 rrs1 ompW aadR/crp/fnr htrA ssb traR/dksA tldD ftsK nifS folC ctrA ftsH tilS tolB ruvB ruvA ruvC secA argJ hisA hisF parB parA gidB gidA trmE/thdF |
|
NC_021150 dnaA recF gyrB hdtS glyS glyQ trkA sun fmt def lysM dprA yrdC qor aroE pcoA pcoB cysC2 uvrD polA thrB gadh3 gadh1 gadh2 cyoE ctaD purE purK modA2 nifH nifD nifK nifT nifY lrv nifE nifN nifX fesI iscAnif nifU nifS nifV cysE1nif nifW nifZ nifM clpX nifF osmC pagL dsbA cycA engB zwf-4 hexR-3 pycA pycB trpI trpB trpA cbpA clpP aglA-2 vnfY vnfK vnfG vnfD vnfF vnfH pcaK moeB2 vnfX vnfN vnfE vnfA vnfU relA rpoZ gmk rph crc pyrE argB algC dut coaBC radC mtgA thiG trmB rdgB metW metX proC pilT pyrC pyrB pyrR gshB pilG bioA ilvD hipA coaD metF ahcY ligB fucP dctM dctQ dctP dctD2 dctB2 fdhD fdhA fdhH fdhI fdhE moaA mqo glk gluP hoxY hoxH hoxW cooJ cooT cooC cooS cooF yggX mutY hisB hisH hisA hisF pgm-1 secB argE argA tonB gshA bcsZ bcsD bcsC bcsAB ompR envZ hslO pckA cls nudE metK tktA-3 epd pgk fba cccA katG glcB dsbG bioB bioF bioH bioC birA coaX tuf secE nusG rplK rplA rplJ rplL rpoB rpoC rpsL rpsG fusA rpsJ rplC rplD rplW rplB rpsS rplV rpsC rplP rpmC rpsQ rplN rplX rplE rpsN rplR rpsE rpmD rplO secY rpmJ rpsM rpsD rpoA rplQ uvrA oprG-2 oprG-1 nrdR ribD ribE ribB ribH nusB thiL ribA retS purD purH fis prmA accC accB aroQ dipZ speA serB psd rhdA mutL miaA hfq hflX hflK hflC hisZ purA rnr rpsF rpsR rplI dnaB fpr2 tesB trmA bcpB btuB dxs-1 ispA-1 xseB-1 ppa aldA mpl ubiX tonB2 ftsH lrgA lrgB proA nadD rodA mltB dacA lipB lipA benR benA benB benC benD xylJ xylI xylH xylQ xylT xylE xylG xylK dmpI dmpB dmpP dmpO dmpN dmpM dpmL dmpK holA leuS lnt miaB rhaU hpaI aldB hemL thiE thiD ureG ureE dctA-2 dsbB leuD2 leuC2 ureC ureB ureA eutB eutC rluE hrpB cueR selD folX mdcA citG mdcC mdcD mdcG madL madM nhaA fixA fixB fixC fixX tonB3 algA algF algV algI algL algX algG algJ algK alg44 alg8 algD algB coxB coxA moaC moaD moaE rhlB estB apc4 apc3 apc2 apc1 valS pepA lptF lptG mviN ribF ileS lspA ispH fimT pilE pilV pilW pilY1 thiO comL rluD clpB ndh coaE pilD pilC pilB pilA nadC ampD ampE fruR fruB fruK fruA prfC speC lldD gatB gatA gatC mreB mreC mreD maf tldD tldE ptsO ptsN rpoN lptB lptA kpsF murA hisG hisD hisC algW cysD cysN trpS nfi rplM rpsI petA petB petC sspA sspB gmhA mraZ mraW ftsL ftsI murE murF mraY murD ftsW murG murC ddl ftsQ ftsA ftsZ lpxC secA argJ apbA ampG fabG pdxH fxsA groES groEL rsuA ung nadB mucA mucB mucC mucD lepA lepB rnc era alyA3 recO pdxJ mmsA lon putA metR metE argF bfr rnt pyrC2 argG mobA moaB moeA ech vdh fcs aat1 etfB1 eftA1 pobA pobR fadB fadA topA lexA psrA mfd gapB nqrA nqrB nqrC nqrD nqrE nqrF apbE sthA lolC lolD lolE comA exbB exbD lpxK kdsB ptpA murB rne rluC rpmF plsX fabD fabG2 acpP fabF pabC tmk holB pilZ mhpB mphC mhpR mhpD mhpF mphE mhpT etfB2 eftA2 pyrF efp ohr hmgA htpX fadE queF trkH ispZ scpA scpB edd-1 glk2 hexR-2 pgl-1 eda-2 gph ubiG mtnA gyrA serC pheA hisC2 aroA cmk rpsA ihfB rfaH rfbB rfbD rfbA rfbC gspD gspG2 gspE cysC1 eexD eexE eexF hrpA fadB2 etfB3 etfA3 zwf-3 pgi zwf-2 pgl-2 pgi2 colS colR ppdK dctB dctD pepN pyrD rlmL dacB htpG glxI rnfE rnfG rnfD rnfC rnfB rnfA metG apbC dcd dinG agaA estC pdxH2 shaAB shaC shaD shaF shaG livG livM dnaX recR cydB cydA cydR hemN ccoS ccoI ccoH ccoG ccoP ccoQ ccoO ccoN acnA tusA tyrB uvrB gltX thrS infC rpmI rplT pheS pheT ihfA dxs-2 ispA-2 xseB-2 idi acxA acxB acxC acxR rhdE cysE3 alc entA entF entB entE csbC csbX vbiC metC ssuD tauD nemA asfC asfB asfA torG rpiB talB-1 tktA-1 rpiA-1 pykA-2 eno-2 xdhC xdhB xdhA dszA gntT gntK-1 feoA feoB feoC exeA greB oprI aroG cysB cysH pabB gntR prpB prpC acnD prpF ppsA rraA oprF nasH nasB nasA narK atsA pfkB nasT nasS acnB lpxH glnS cysS folD tig clpP2 clpX2 pepS16 ppiD phbC phbA phbB phbR phbP phbF folE speE alyA2 flgN flgM flgA flgB flgC flgD flgE flgF flgG flgH flgI flgJ flgK flgL fliR fliQ fliP fliO fliN fliM fliL fliK fliJ fliI fliH fliG fliF fliE fliD fliS fliT flhX flhD flhC motA motB cheA cheW cheR kdpD kdpE kup kdsA glgX treY malQ treZ glgA galU gor gacA uvrC pgsA actP fpvI fpvR groEL2 gapA gcvP2 sdaA gcvT3 hutG hutI hutU phyH hutH hutF nifA2 lysE pvdE ptxS kguE kguK kguT kguD fagA fumC eno-1 pykA-1 eda-1 zwf-1 hexR-1 edd-2 gntK-2 mtlY tktA-2 talB-2 mtlD mtlK mtlG mtlF mtlE mtlR pgm-2 ybhE cysI nudC motD motC fliC rfbC2 rfbG rfbF hicB flaG fliA flhE flhF flhA flhB cheZ cheY cheB glgB treS glgE cobA serS rarA lolA ftsK trxB aat infA clpA clpS cspD icd mnmA purB nuoA nuoB nuocd nuoE nuoF nuoG nuoH nuoI nuoJ nuoK nuoL nuoM nuoN gspG1 gspF nfuA xth ppiC sixA gpsA fabA fabB metH arsD arsC arsB arsA dnaQ rnhA fabI sucC lpdA sucB sucA sdhB sdhA sdhD sdhC gltA wbpO wbpP phaJ codA hepA ligA zipA smc moaA2 cycH ccmH ccmG ccmF ccmE ccmD ccmC ccmB ccmA fleN sodA ssuB ssuC ssuA ssuE cysP lapQ lapG lapF lapI lapH lapE lapC arsE lapP lapO lapN lapM lapL lapK lapB lapR ttdB ttdA dctA-3 dszA2 cas2 cas1 cas4 csd2 csd1 cas5d oprE cysT cysW cysA alyA1 gcvT2 hpaR talB-3 cyaB recQ cysK rnd cobS cobC cobT cobU cobQ cobD cbiA btuB2 aroF dctA-1 pcpS nrdA nrdB vnfA2 algE5 mexT mexE mexF ppnK metZ purF folC accD trpF truA asd leuB leuD leuC dusC fleQ algZ mgtE csrA lysC alaS argD acsA1 pta-1 ackA-1 phhA phhB ggpS hppD fumC2 mnmC asnB actP2 htrB minC minD minE rluA rdgC purC dapA nadA frdC trbG trbF trbL trbJ trbE trbD trbC trbB traF oprL tolB tolA tolR tolQ ybgC ruvB ruvA ruvC aspS proS pgsA2 purM purN relB relE1 mazG ilvM rumA cysM gacS dinP tmp lysB rimO pfpI cls2 sodB araJ aroQ2 pip hppD2 pcaC pcaD pcaB pcaF pcaJ pcaI pcaR pcaG pcaH pcaQ purU mvaT sbcB pykA-6 fumB fpr1 finR recX recA ibpB ibpA mutS fdxA aerP rpoS nlpD pcm surE truD ispF fghA ispD ftsB eno kdsA2 pyrG metJ accA dnaE rnhB lpxB lpxA fabZ lpxD ompH mucP dxr csdA uppS frr pyrH tsf rpsB map glnD dapC dapD thiF dapE rrmA cspA yeaZ adk ppc lysS prfB recJ yaeQ thrC hom dsbC xerD rplS trmD rimM rpsP ffh purT purL guaA xseA acoK gcd pvdH fpvB engA hisS ispG pilF ndk iscX fdx hscA hscB iscA iscU iscS iscR cysE2 trmH suhB secF secD yajC tgt queA rpsT proB obgE rpmA rplU gerC fklB fpvA glyA cobW cstA radA mscL ackA-2 pta-2 mqo2 hpt upp hemH phr murI prfA hemA lolB ipk pth ychF pqqF pqqE pqqD pqqC pqqB acoR acoA acoB acoC adh otsA otsB rimI leuA pqiA pssA ilvC ilvH ilvI mrcB hmuV sfsA aspT dksA cbrA cbrB pcnB folK panB panC panD pgi3 acsA2 pnp rpsO truB rbfA infB nusA secG glmM folP ftsH1 rrmJ greA carB carA dapB dnaJ dnaK grpE recN fur omlA smpB lctP mosA mosB maeB2 lldp glcB2 glcG glcF glcE glcD glcC eno-3 pykA-3 cdaR yhaD gcl garR pirA parC parE sfnG seuB ribAB livH seuA metB msuD sfnR pta-3 ackA-3 thiC rfaE msbA galE waaP waaG ilvE glnE aceE aceF msrA relE2 hemE gltD gltB aroB aroK pilQ pilP pilO pilN pilM ponA maeB rpmE priA argS hslV hslU ubiE ubiB hisI hisE tatA tatB tatC mdoH2 mdoD mdoH1 mdoG dtd fbp glpF glpK glpR glpD eno-4 tpiA-2 typA thiI glnA ntrB ntrC ahpC bioD tyrS anmK erpA argC coq7 speD crp trpC trpD trpGD pdhB pdhA cadR pncA mdaB ada hpaI2 trpE gph2 rpe gabT lptD pdxA ksgA apaG apaH glpE prkA cca folK1 folB gcp rpsU dnaG rpoD cooA katN vnfA3 fecA gcvP1 gcvH gcvT ubiF ubiH ubiD rho trxA ppx ppk hemB feoB2 algQ dsbB2 hemD hemC algR argH corA cyaY lppL lysA dapF xerC amtB glnK pirR rep xpt cycB dadX dadA lrp aldH rpmG rpmB rpoH ftsX ftsE ftsY glpQ thyA lgt ptsP nudH ilvA rpiA-2 serA recG muc26 hupA ubiC ubiA phoB phoR phoU pstB pstA pstC pstS potI potH potG potF spuC spuB spuA anfR anfO anfK anfG anfD anfH anfA anfU gabD algY tctE tctD tctC tctB tctA fic tpiA-3 tpiA-1 eno-5 pykA-5 iolC iolE iolB iolT iolA iolD idh iolH hppD3 fahA maiA nafY rnfH rnfE1 rnfG1 rnfD1 rnfC1 rnfB1 rnfA1 nifL nifA nifB nifO nifQ norR hmp algE3 algE2 algE1 algE4 algE6 algE7 dgoT galD dgoA dgoK dgoR pykA-4 melA aglA-1 scrY-2 hbdH bktA lpxO recD recB recC nhaA2 scrY-1 lacY scrB scrR glmS glmR glmU atpC atpD atpG atpA atpH atpF atpE atpB atpI parB parA gidB gidA trmE rnpA rpmH |
Thanks for that thought -- I agree the /plasmid tag can be useful.. By an organism's genome, I mean all the DNA carried by an organism, plasmids, chromosomes, and prophages, so I want to match all of these gbk files to the organism (host if you prefer) from which they were sequenced. Separately, I agree, sometimes it is unclear by source tags which is a chromosome (or 'complete genome').
So what is the problem then? Each folder contains only the information for a single strain.
The
DSM 16854 = C23
entry under/strain
is simply saying that they're equivalent names for the same strain. C23 is the name given to the strain, DSM16854 is an identifier given to strain C23 by the DSMZ Bacteria Collection. They both point to the same thing.https://www.dsmz.de/catalogues/details/culture/DSM-16854.html
Nope, that's what I had hoped (and then it would be simple), but folders sometimes contain multiple strains, or even multiple unrelated organisms, such as:
ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Vibrio_parahaemolyticus_O1_K33_CDC_K4557_uid212977
and yes, in the example it is easy enough to read the differing strain info for the chromosome and plasmid, and understand they are the same, but getting code to do that is not so easy...
You should be able to use taxid to handle the case of multiple species in a single directory. If the taxid for a file fails to match that of the directory species, toss it. I guess you could limit to one genome per folder. Parsing by organism name is probably more work than it's worth.
For handling strain naming problems there's no obvious option outside of downloading from all the databases and making a look up table. Even this might not provide for full confidence, it is clear that the strain annotations are inconsistently implemented.
I'm not sure that there is a single point to filter on that will provide you with what you need. I've never had a good experience filtering .gbk files, especially when pulling them from the bacteria ftp (it was a few years ago the last time I did). I've never been fully confident of whatever filtering kludge I worked up, I usually end up at reading through the data set in the end just to be sure. Unless you're writing software you plan to distribute, or you need frequent updates, I'd wager that you will save time by just manually checking the files.
Yeah, similar history here. One kludge after another. I find it disheartening that NCBI has given up on subspecies taxids, as there is no controlled vocabulary for these. I could toss files, but my goal is to parse all available prok genomes, and that Vibrio example may be a harbinger of what is to come -- a mix of species and strains all within one bioproject directory. I'm still hoping someone will have a solution we haven't thought of ...
http://jgi.doe.gov/ Has done a much better job with sequence annotation and curation, you might want to check there instead. However they do seem to be having some database issues currently.
Thanks. I've used them in the past, but need to be more current (want to have at least 95% of completed genomes already deposited in NCBI). I'll check there again though.