Entering edit mode
5.6 years ago
pecunarg
•
0
Hi,
I was trying to create a database from NCBI Nucleotide bank. I did a query which gave me 1124 results.
From each single result I was wanting to obtain only the items realted to Accesion
Country
and isolation
.
Here is what I got from the NCBI
LOCUS MH973850 410 bp DNA linear PLN 03-MAR-2019
DEFINITION Cryptococcus neoformans isolate OA2 internal transcribed spacer 1, partial sequence; 5.8S ribosomal RNA gene, complete sequence; and internal transcribed spacer 2, partial sequence.
ACCESSION MH973850
VERSION MH973850.1
KEYWORDS .
SOURCE Cryptococcus neoformans
ORGANISM Cryptococcus neoformans
Eukaryota; Fungi; Dikarya; Basidiomycota; Agaricomycotina;
Tremellomycetes; Tremellales; Cryptococcaceae; Cryptococcus;
Cryptococcus neoformans species complex.
REFERENCE 1 (bases 1 to 410)
AUTHORS Abaci Gunyar,O., Yoltas,A., Haliki Uztan,A. and Yamac,M.
TITLE Isolation and Identification of Cryptococcus neoformans from the
soil samples taken from inside and outside of Nigde Duzkir (=
Aladaglar) cave
JOURNAL Unpublished
REFERENCE 2 (bases 1 to 410)
AUTHORS Abaci Gunyar,O., Yoltas,A., Haliki Uztan,A. and Yamac,M.
TITLE Direct Submission
JOURNAL Submitted (24-SEP-2018) Biology, Ege University, Genclik Caddesi,
Izmir 35040, Turkiye
COMMENT ##Assembly-Data-START##
Sequencing Technology :: Sanger dideoxy sequencing
##Assembly-Data-END##
FEATURES Location/Qualifiers
source 1..410
/organism="Cryptococcus neoformans"
/mol_type="genomic DNA"
/isolate="OA2"
/isolation_source="Soil sample"
/db_xref="taxon:5207"
misc_RNA <1..>410
/note="contains internal transcribed spacer 1, 5.8S
ribosomal RNA, and internal transcribed spacer 2"
ORIGIN
1 aggatcagta gagaatattg gacttcggtc catttatcta cccatctaca cctgtgaact
61 gtttatgtgc ttcggcacgt tttacacaaa cttctaaatg taatgaatgt aatcttatta
121 taacaataat aaaactttca acaacggatc tcttggcttc cacatcgatg aagaacgcag
181 cgaaatgcga taagtaatgt gaattgcaga attcagtgaa tcatcgaatc tttgaacgca
241 acttgcgccc tttggtattc cgaagggcat gcctgtttga gagtcatgaa aatctcaatc
301 cctcgggttt tattacctgt tggacttgga tttgggtgtt tgccgcgacc tgcaaaggac
361 gtcggctcgc cttaaatgtg ttagtgggaa ggtgattacc tgtcagcccg
What I want to obtain from the whole data
ACCESSION MH973850
REFERENCE 2 (bases 1 to 410)
JOURNAL Submitted (24-SEP-2018), Turkiye
FEATURES Location/Qualifiers
source 1..410
/isolation_source="Soil sample"
Which is the best way to accomplish this?
Technically this can be done using Entrez Direct tools by fetching the entire flatfile in XML format using the
efetch
tool and parsing it using thextract
tool. But I'd expect the code to be a bit more readable with biopython. What have you tried so far?I've done: Downloading the XML file of the query and then parse by the child specific tag. But I had the problem that the XML file downloaded was 8GB size, soy the main problem was at the begining because I've never could upload the XML. Then i tried to do the entrez direct e-utilities but I never could pass the first part.