Biopython download nucleotide records without sequences (or skip huge sequences)
1
0
Entering edit mode
3.0 years ago
abu • 0

I am trying to download information from NCBI Entrez databases (nucleotide), using Biopython package.

I don't need molecular data at all. I just want to check the textual information about certain records, to see references, authors, journals, and information about voucher specimens from which the genome sample was extracted.

My query returns a lot of records, and I need to check if they are related or not to my institution staff (if so, then I extract relevant information about authors, journal, acc number and voucher specimen).

This is my code:

from Bio import Entrez
Entrez.email = "Your.Name.Here@example.org"
# first I try to find the records I am interested in. Example:
query = "Helianthus[Organism]" 
handle = Entrez.esearch(db="nuccore", retmax=1000, term=query, idtype="acc")
records = Entrez.read(handle)
idsList = record["IdList"]
handle.close()

# now I want to retrieve information about each record (but NOT sequences):
handle = Entrez.efetch(db="nuccore", id=idsList, rettype="gb", retmode="text")

# and then I would parse this handle to check the info I am interest in

No problem doing this. But I found that some records include very long sequences (full genome I guess). So the downloaded file will be huge.

Is there any way to avoid genome information to be included in the download?

Or perhaps there is a way to get info about each accession size in my Entrez.esearch query, so I can remove those records from my idsList above?

For example, given this record, this is the only information I want to read.

I am particularly interested in the /specimen_voucher="SF193" line near the end of this text:

LOCUS       MNCJ02000332       195042445 bp    DNA     linear   PLN 13-JUL-2020
DEFINITION  Helianthus annuus cultivar XRQ/B chromosome 17, whole genome
            shotgun sequence.
ACCESSION   MNCJ02000332 MNCJ02000000
VERSION     MNCJ02000332.1
DBLINK      BioProject: PRJNA345532
            BioSample: SAMN05868438
KEYWORDS    WGS.
SOURCE      Helianthus annuus (common sunflower)
  ORGANISM  Helianthus annuus
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
            Spermatophyta; Magnoliopsida; eudicotyledons; Gunneridae;
            Pentapetalae; asterids; campanulids; Asterales; Asteraceae;
            Asteroideae; Heliantheae alliance; Heliantheae; Helianthus.
REFERENCE   1  (bases 1 to 195042445)
  AUTHORS   Badouin,H., Gouzy,J., Grassa,C.J., Murat,F., Staton,S.E.,
            Cottret,L., Lelandais-Briere,C., Owens,G.L., Carrere,S.,
            Mayjonade,B., Legrand,L., Gill,N., Kane,N.C., Bowers,J.E.,
            Hubner,S., Bellec,A., Berard,A., Berges,H., Blanchet,N.,
            Boniface,M.C., Brunel,D., Catrice,O., Chaidir,N., Claudel,C.,
            Donnadieu,C., Faraut,T., Fievet,G., Helmstetter,N., King,M.,
            Knapp,S.J., Lai,Z., Le Paslier,M.C., Lippi,Y., Lorenzon,L.,
            Mandel,J.R., Marage,G., Marchand,G., Marquand,E., Bret-Mestries,E.,
            Morien,E., Nambeesan,S., Nguyen,T., Pegot-Espagnet,P., Pouilly,N.,
            Raftis,F., Sallet,E., Schiex,T., Thomas,J., Vandecasteele,C.,
            Vares,D., Vear,F., Vautrin,S., Crespi,M., Mangin,B., Burke,J.M.,
            Salse,J., Munos,S., Vincourt,P., Rieseberg,L.H. and Langlade,N.B.
  TITLE     The sunflower genome provides insights into oil metabolism,
            flowering and Asterid evolution
  JOURNAL   Nature 546 (7656), 148-152 (2017)
   PUBMED   28538728
REFERENCE   2  (bases 1 to 195042445)
  AUTHORS   Gouzy,J., Langlade,N. and Munos,S.
  TITLE     Helianthus annuus Genome sequencing and assembly Release 2
  JOURNAL   Unpublished
REFERENCE   3  (bases 1 to 195042445)
  AUTHORS   Langlade,N. and Munos,S.
  TITLE     Direct Submission
  JOURNAL   Submitted (27-FEB-2017) Laboratoire des Interactions Plantes
            Micro-organismes, INRA/CNRS, Chemin de Borderouge, Castanet-Tolosan
            31200, France
REFERENCE   4  (bases 1 to 195042445)
  AUTHORS   Gouzy,J., Langlade,N. and Munos,S.
  TITLE     Direct Submission
  JOURNAL   Submitted (08-JUN-2020) Laboratoire des Interactions Plantes
            Micro-organismes, INRAE/CNRS, Chemin de Borderouge,
            Castanet-Tolosan 31320, France
COMMENT     ##Genome-Assembly-Data-START##
            Assembly Date          :: 17-AUG-2018
            Assembly Method        :: CANU v. 1.3; CANU v. 1.4; FALCON v. 0.7;
                                      Bionano-Solve v. 3.2.1_04122018
            Assembly Name          :: HanXRQr2.0-SUNRISE
            Genome Representation  :: Full
            Expected Final Version :: No
            Genome Coverage        :: 100.0x
            Sequencing Technology  :: PacBio RSII
            ##Genome-Assembly-Data-END##
FEATURES             Location/Qualifiers
     source          1..195042445
                     /organism="Helianthus annuus"
                     /mol_type="genomic DNA"
                     /submitter_seqid="HanXRQChr17"
                     /cultivar="XRQ/B"
                     /specimen_voucher="SF193"
                     /db_xref="taxon:4232"
                     /chromosome="17"
                     /tissue_type="leaves"
                     /dev_stage="4 leaves"
                     /country="France"
                     /collected_by="INRA, LIPM"

Below that, there is a lot of more info (gene, mRNA, CDS ...) up to 11 MB which I do NOT need.

Are there any chances that I can avoid all that stuff to be downloaded?

Or at least, is there a way to skip records bigger than a given size?

Biopython • 1.0k views
ADD COMMENT
0
Entering edit mode
3.0 years ago
vkkodali_ncbi ★ 3.8k

The specimen_voucher information is coming from BioSample. You don't have to download the entire nucleotide record for this.

If you have a bunch of nucleotide identifiers, you can use elink to get the biosample data in XML format and then parse it to extract the specimen information.

ADD COMMENT
0
Entering edit mode

Thanks @vkkodali

I didn't know about BioSample database, but the link you posted contains much less information available comparing to Nucleotide DB (I also need to check publication authors, title and journal).

Also, the example nucleotide accession link I posted contains this linked information, which is where you found that link (I guess):

DBLINK      BioProject: PRJNA345532
            BioSample: SAMN05868438

But in fact, many other Nucleotide records (KP941566, AJ632189, MT943645) don't contain any links to those databases.

So I guess BioSample wouldn't help in most of my cases. I could give it a try but I don't know how to get papers information from BioSample database. Could you post an example code?

Thanks a lot in advance

ADD REPLY

Login before adding your answer.

Traffic: 2026 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6