I am trying to download information from NCBI Entrez databases (nucleotide), using Biopython package.
I don't need molecular data at all. I just want to check the textual information about certain records, to see references, authors, journals, and information about voucher specimens from which the genome sample was extracted.
My query returns a lot of records, and I need to check if they are related or not to my institution staff (if so, then I extract relevant information about authors, journal, acc number and voucher specimen).
This is my code:
from Bio import Entrez
Entrez.email = "Your.Name.Here@example.org"
# first I try to find the records I am interested in. Example:
query = "Helianthus[Organism]"
handle = Entrez.esearch(db="nuccore", retmax=1000, term=query, idtype="acc")
records = Entrez.read(handle)
idsList = record["IdList"]
handle.close()
# now I want to retrieve information about each record (but NOT sequences):
handle = Entrez.efetch(db="nuccore", id=idsList, rettype="gb", retmode="text")
# and then I would parse this handle to check the info I am interest in
No problem doing this. But I found that some records include very long sequences (full genome I guess). So the downloaded file will be huge.
Is there any way to avoid genome information to be included in the download?
Or perhaps there is a way to get info about each accession size in my Entrez.esearch
query, so I can remove those records from my idsList
above?
For example, given this record, this is the only information I want to read.
I am particularly interested in the /specimen_voucher="SF193"
line near the end of this text:
LOCUS MNCJ02000332 195042445 bp DNA linear PLN 13-JUL-2020
DEFINITION Helianthus annuus cultivar XRQ/B chromosome 17, whole genome
shotgun sequence.
ACCESSION MNCJ02000332 MNCJ02000000
VERSION MNCJ02000332.1
DBLINK BioProject: PRJNA345532
BioSample: SAMN05868438
KEYWORDS WGS.
SOURCE Helianthus annuus (common sunflower)
ORGANISM Helianthus annuus
Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
Spermatophyta; Magnoliopsida; eudicotyledons; Gunneridae;
Pentapetalae; asterids; campanulids; Asterales; Asteraceae;
Asteroideae; Heliantheae alliance; Heliantheae; Helianthus.
REFERENCE 1 (bases 1 to 195042445)
AUTHORS Badouin,H., Gouzy,J., Grassa,C.J., Murat,F., Staton,S.E.,
Cottret,L., Lelandais-Briere,C., Owens,G.L., Carrere,S.,
Mayjonade,B., Legrand,L., Gill,N., Kane,N.C., Bowers,J.E.,
Hubner,S., Bellec,A., Berard,A., Berges,H., Blanchet,N.,
Boniface,M.C., Brunel,D., Catrice,O., Chaidir,N., Claudel,C.,
Donnadieu,C., Faraut,T., Fievet,G., Helmstetter,N., King,M.,
Knapp,S.J., Lai,Z., Le Paslier,M.C., Lippi,Y., Lorenzon,L.,
Mandel,J.R., Marage,G., Marchand,G., Marquand,E., Bret-Mestries,E.,
Morien,E., Nambeesan,S., Nguyen,T., Pegot-Espagnet,P., Pouilly,N.,
Raftis,F., Sallet,E., Schiex,T., Thomas,J., Vandecasteele,C.,
Vares,D., Vear,F., Vautrin,S., Crespi,M., Mangin,B., Burke,J.M.,
Salse,J., Munos,S., Vincourt,P., Rieseberg,L.H. and Langlade,N.B.
TITLE The sunflower genome provides insights into oil metabolism,
flowering and Asterid evolution
JOURNAL Nature 546 (7656), 148-152 (2017)
PUBMED 28538728
REFERENCE 2 (bases 1 to 195042445)
AUTHORS Gouzy,J., Langlade,N. and Munos,S.
TITLE Helianthus annuus Genome sequencing and assembly Release 2
JOURNAL Unpublished
REFERENCE 3 (bases 1 to 195042445)
AUTHORS Langlade,N. and Munos,S.
TITLE Direct Submission
JOURNAL Submitted (27-FEB-2017) Laboratoire des Interactions Plantes
Micro-organismes, INRA/CNRS, Chemin de Borderouge, Castanet-Tolosan
31200, France
REFERENCE 4 (bases 1 to 195042445)
AUTHORS Gouzy,J., Langlade,N. and Munos,S.
TITLE Direct Submission
JOURNAL Submitted (08-JUN-2020) Laboratoire des Interactions Plantes
Micro-organismes, INRAE/CNRS, Chemin de Borderouge,
Castanet-Tolosan 31320, France
COMMENT ##Genome-Assembly-Data-START##
Assembly Date :: 17-AUG-2018
Assembly Method :: CANU v. 1.3; CANU v. 1.4; FALCON v. 0.7;
Bionano-Solve v. 3.2.1_04122018
Assembly Name :: HanXRQr2.0-SUNRISE
Genome Representation :: Full
Expected Final Version :: No
Genome Coverage :: 100.0x
Sequencing Technology :: PacBio RSII
##Genome-Assembly-Data-END##
FEATURES Location/Qualifiers
source 1..195042445
/organism="Helianthus annuus"
/mol_type="genomic DNA"
/submitter_seqid="HanXRQChr17"
/cultivar="XRQ/B"
/specimen_voucher="SF193"
/db_xref="taxon:4232"
/chromosome="17"
/tissue_type="leaves"
/dev_stage="4 leaves"
/country="France"
/collected_by="INRA, LIPM"
Below that, there is a lot of more info (gene, mRNA, CDS ...) up to 11 MB which I do NOT need.
Are there any chances that I can avoid all that stuff to be downloaded?
Or at least, is there a way to skip records bigger than a given size?
Thanks @vkkodali
I didn't know about BioSample database, but the link you posted contains much less information available comparing to Nucleotide DB (I also need to check publication authors, title and journal).
Also, the example nucleotide accession link I posted contains this linked information, which is where you found that link (I guess):
But in fact, many other Nucleotide records (KP941566, AJ632189, MT943645) don't contain any links to those databases.
So I guess BioSample wouldn't help in most of my cases. I could give it a try but I don't know how to get papers information from BioSample database. Could you post an example code?
Thanks a lot in advance