Question

Parse JSON ncbi_datasets summary output

0

Entering edit mode

3.4 years ago

katieostrouchov ▴ 30

How do we obtain the tax_id, sci_name, and assembly_accession from a summary json file using jq?

I'm wanting to collect this information before I download the associated reference proteomes for multiple genera using the conda ncbi_datasets package.

I have been unsuccessful with the following code.

Example code using jq:

datasets summary genome taxon "Bacteroides" --reference > Bacteroides.json\

cat Bacteroides.json | \

jq -r '.assemblies[].assembly[.assembly_accession] | \

.assemblies[].assembly[.tax_id] | \

.assemblies[].assembly[.sci_name] | \

@text' > test.txt

Example code using grep:

grep -C 1 tax_id Bacteroides.json > test.txt

Below is the first entry from Bacteroides.json using jq . format

{
  "assemblies": [
    {
      "assembly": {
        "annotation_metadata": {
          "file": [
            {
              "estimated_size": "316141",
              "type": "GENOME_GFF"
            },
            {
              "estimated_size": "3839359",
              "type": "GENOME_GBFF"
            },
            {
              "estimated_size": "984979",
              "type": "PROT_FASTA"
            },
            {
              "estimated_size": "357603",
              "type": "GENOME_GTF"
            },
            {
              "estimated_size": "1636624",
              "type": "CDS_FASTA"
            }
          ],
          "name": "From INSDC submitter",
          "release_date": "07/08/2021",
          "source": "NCBI RefSeq",
          "stats": {
            "gene_counts": {
              "protein_coding": 4150,
              "total": 4465
            }
          }
        },
        "assembly_accession": "GCF_910575425.1",
        "assembly_category": "representative genome",
        "assembly_level": "Scaffold",
        "bioproject_lineages": [
          {
            "bioprojects": [
              {
                "accession": "PRJEB45232",
                "title": "Genome assemblies for the Mouse Culture Collection (MGBC)"
              }
            ]
          }
        ],
        "biosample_accession": "SAMEA8801361",
        "chromosomes": [
          {
            "gc_count": "2261981",
            "length": "5258315",
            "name": "Un"
          }
        ],
        "contig_n50": 109360,
        "display_name": "MGBC000127",
        "estimated_size": "8705042",
        "gc_count": "2261981",
        "org": {
          "assembly_counts": {
            "node": 33,
            "subtree": 33
          },
          "isolate": "B76_1_BA_BP",
          "key": "85831",
          "parent_tax_id": "816",
          "rank": "SPECIES",
          "sci_name": "Bacteroides acidifaciens",
          "tax_id": "85831",
          "title": "Bacteroides acidifaciens"
        },
        "paired_assembly_accession": "GCA_910575425.1",
        "seq_length": "5258315",
        "submission_date": "2021-07-07",
        "submitter": "UNIVERSITY OF CAMBRIDGE"
      }
    }

parse json ncbi_datasets jq • 2.4k views

ADD COMMENT • link 3.4 years ago by katieostrouchov ▴ 30

score 4 · Accepted Answer · 2022-01-11

4

Entering edit mode

3.4 years ago

vkkodali_ncbi ★ 3.8k

You were almost there! Try the following to get a tab-delimited output:

cat Bacteroides.json | jq -r '.assemblies[].assembly|[.assembly_accession,.org.tax_id,.org.sci_name]|@tsv'

An alternative to jq is to use dataformat. However, at this time dataformat cannot process the json emitted by datasets summary command. It can only process the json-lines report files produced by datasets download command.

To use dataformat in this situation:

## first download a dehydrated package as we are interested in only the data-report
datasets download genome taxon "Bacteroides" --reference --dehydrated
## then use dataformat to extract fields of interest
dataformat tsv genome --package ncbi_dataset.zip --fields assminfo-accession,tax-id,organism-name

ADD COMMENT • link 3.4 years ago by vkkodali_ncbi ★ 3.8k

0

Entering edit mode

Isn't dataformatsupposed to be used for parsing the Json file? What would the comparable command there?

ADD REPLY • link 3.4 years ago by GenoMax 151k

0

Entering edit mode

dataformat excel genome --inputfile Bacteroides/ncbi_dataset/data/assembly_data_report.jsonl --fields organism-name,tax-id,assminfo-refseq-assm-accession -o ./lists/Bacteriodes.xlsx

Input file needs to be .jsonl or dataformat will not be able to parse it

ADD REPLY • link 3.4 years ago by katieostrouchov ▴ 30

0

Entering edit mode

The jq text worked! Thank you for your prompt response.

Checking only the json file using datasets summary reduced the directory size by about half when compared to using datasets download for the dehydrated protein.faa files for Bacteroides with roughly 50 references. Though it may be slightly larger in size, I do like that it is quite easy to rehydrate to obtain the fasta files after examination of the potential contents using dataformat.