How do we obtain the tax_id, sci_name, and assembly_accession from a summary json file using jq?
I'm wanting to collect this information before I download the associated reference proteomes for multiple genera using the conda ncbi_datasets package.
I have been unsuccessful with the following code.
Example code using jq:
datasets summary genome taxon "Bacteroides" --reference > Bacteroides.json\
cat Bacteroides.json | \
jq -r '.assemblies[].assembly[.assembly_accession] | \
.assemblies[].assembly[.tax_id] | \
.assemblies[].assembly[.sci_name] | \
@text' > test.txt
Example code using grep:
grep -C 1 tax_id Bacteroides.json > test.txt
Below is the first entry from Bacteroides.json using jq . format
{
"assemblies": [
{
"assembly": {
"annotation_metadata": {
"file": [
{
"estimated_size": "316141",
"type": "GENOME_GFF"
},
{
"estimated_size": "3839359",
"type": "GENOME_GBFF"
},
{
"estimated_size": "984979",
"type": "PROT_FASTA"
},
{
"estimated_size": "357603",
"type": "GENOME_GTF"
},
{
"estimated_size": "1636624",
"type": "CDS_FASTA"
}
],
"name": "From INSDC submitter",
"release_date": "07/08/2021",
"source": "NCBI RefSeq",
"stats": {
"gene_counts": {
"protein_coding": 4150,
"total": 4465
}
}
},
"assembly_accession": "GCF_910575425.1",
"assembly_category": "representative genome",
"assembly_level": "Scaffold",
"bioproject_lineages": [
{
"bioprojects": [
{
"accession": "PRJEB45232",
"title": "Genome assemblies for the Mouse Culture Collection (MGBC)"
}
]
}
],
"biosample_accession": "SAMEA8801361",
"chromosomes": [
{
"gc_count": "2261981",
"length": "5258315",
"name": "Un"
}
],
"contig_n50": 109360,
"display_name": "MGBC000127",
"estimated_size": "8705042",
"gc_count": "2261981",
"org": {
"assembly_counts": {
"node": 33,
"subtree": 33
},
"isolate": "B76_1_BA_BP",
"key": "85831",
"parent_tax_id": "816",
"rank": "SPECIES",
"sci_name": "Bacteroides acidifaciens",
"tax_id": "85831",
"title": "Bacteroides acidifaciens"
},
"paired_assembly_accession": "GCA_910575425.1",
"seq_length": "5258315",
"submission_date": "2021-07-07",
"submitter": "UNIVERSITY OF CAMBRIDGE"
}
}
Isn't
dataformat
supposed to be used for parsing the Json file? What would the comparable command there?dataformat excel genome --inputfile Bacteroides/ncbi_dataset/data/assembly_data_report.jsonl --fields organism-name,tax-id,assminfo-refseq-assm-accession -o ./lists/Bacteriodes.xlsx
Input file needs to be .jsonl or dataformat will not be able to parse it
The jq text worked! Thank you for your prompt response.
Checking only the json file using datasets summary reduced the directory size by about half when compared to using datasets download for the dehydrated protein.faa files for Bacteroides with roughly 50 references. Though it may be slightly larger in size, I do like that it is quite easy to rehydrate to obtain the fasta files after examination of the potential contents using dataformat.
Please accept the answer to provide closure to this thread (green checkmark).