Question

Which is the best way to work with NCBI data obtaining online partial information from the whole

0

Entering edit mode

5.6 years ago

pecunarg • 0

Hi,

I was trying to create a database from NCBI Nucleotide bank. I did a query which gave me 1124 results. From each single result I was wanting to obtain only the items realted to Accesion Country and isolation.

Here is what I got from the NCBI

LOCUS       MH973850                 410 bp    DNA     linear   PLN 03-MAR-2019
DEFINITION  Cryptococcus neoformans isolate OA2 internal transcribed spacer 1, partial sequence; 5.8S ribosomal RNA gene, complete sequence; and internal transcribed spacer 2, partial sequence.
ACCESSION   MH973850
VERSION     MH973850.1
KEYWORDS    .
SOURCE      Cryptococcus neoformans
 ORGANISM  Cryptococcus neoformans
        Eukaryota; Fungi; Dikarya; Basidiomycota; Agaricomycotina;
        Tremellomycetes; Tremellales; Cryptococcaceae; Cryptococcus;
        Cryptococcus neoformans species complex.
REFERENCE   1  (bases 1 to 410)
 AUTHORS   Abaci Gunyar,O., Yoltas,A., Haliki Uztan,A. and Yamac,M.
 TITLE     Isolation and Identification of Cryptococcus neoformans from the
        soil samples taken from inside and outside of Nigde Duzkir (=
        Aladaglar) cave
 JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 410)
 AUTHORS   Abaci Gunyar,O., Yoltas,A., Haliki Uztan,A. and Yamac,M.
 TITLE     Direct Submission
 JOURNAL   Submitted (24-SEP-2018) Biology, Ege University, Genclik Caddesi,
        Izmir 35040, Turkiye
 COMMENT     ##Assembly-Data-START##
        Sequencing Technology :: Sanger dideoxy sequencing
        ##Assembly-Data-END##
 FEATURES             Location/Qualifiers
 source          1..410
                 /organism="Cryptococcus neoformans"
                 /mol_type="genomic DNA"
                 /isolate="OA2"
                 /isolation_source="Soil sample"
                 /db_xref="taxon:5207"
 misc_RNA        <1..>410
                 /note="contains internal transcribed spacer 1, 5.8S
                 ribosomal RNA, and internal transcribed spacer 2"
ORIGIN      
    1 aggatcagta gagaatattg gacttcggtc catttatcta cccatctaca cctgtgaact
   61 gtttatgtgc ttcggcacgt tttacacaaa cttctaaatg taatgaatgt aatcttatta
  121 taacaataat aaaactttca acaacggatc tcttggcttc cacatcgatg aagaacgcag
  181 cgaaatgcga taagtaatgt gaattgcaga attcagtgaa tcatcgaatc tttgaacgca
  241 acttgcgccc tttggtattc cgaagggcat gcctgtttga gagtcatgaa aatctcaatc
  301 cctcgggttt tattacctgt tggacttgga tttgggtgtt tgccgcgacc tgcaaaggac
  361 gtcggctcgc cttaaatgtg ttagtgggaa ggtgattacc tgtcagcccg

What I want to obtain from the whole data

ACCESSION   MH973850
REFERENCE   2  (bases 1 to 410)
 JOURNAL   Submitted (24-SEP-2018), Turkiye
FEATURES             Location/Qualifiers
 source          1..410
                 /isolation_source="Soil sample"

Which is the best way to accomplish this?

genbank r python • 1.0k views

ADD COMMENT • link updated 3.7 years ago by Biostar 20 • written 5.6 years ago by pecunarg • 0

0

Entering edit mode

Technically this can be done using Entrez Direct tools by fetching the entire flatfile in XML format using the efetch tool and parsing it using the xtract tool. But I'd expect the code to be a bit more readable with biopython. What have you tried so far?

ADD REPLY • link 5.6 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

I've done: Downloading the XML file of the query and then parse by the child specific tag. But I had the problem that the XML file downloaded was 8GB size, soy the main problem was at the begining because I've never could upload the XML. Then i tried to do the entrez direct e-utilities but I never could pass the first part.

ADD REPLY • link 5.6 years ago by pecunarg • 0