Parse JSON ncbi_datasets summary output
1
0
Entering edit mode
2.9 years ago

How do we obtain the tax_id, sci_name, and assembly_accession from a summary json file using jq?

I'm wanting to collect this information before I download the associated reference proteomes for multiple genera using the conda ncbi_datasets package.

I have been unsuccessful with the following code.


Example code using jq:

datasets summary genome taxon "Bacteroides" --reference > Bacteroides.json\

cat Bacteroides.json | \

jq -r '.assemblies[].assembly[.assembly_accession] | \

.assemblies[].assembly[.tax_id] | \

.assemblies[].assembly[.sci_name] | \

@text' > test.txt

Example code using grep:

grep -C 1 tax_id Bacteroides.json > test.txt

Below is the first entry from Bacteroides.json using jq . format

{
  "assemblies": [
    {
      "assembly": {
        "annotation_metadata": {
          "file": [
            {
              "estimated_size": "316141",
              "type": "GENOME_GFF"
            },
            {
              "estimated_size": "3839359",
              "type": "GENOME_GBFF"
            },
            {
              "estimated_size": "984979",
              "type": "PROT_FASTA"
            },
            {
              "estimated_size": "357603",
              "type": "GENOME_GTF"
            },
            {
              "estimated_size": "1636624",
              "type": "CDS_FASTA"
            }
          ],
          "name": "From INSDC submitter",
          "release_date": "07/08/2021",
          "source": "NCBI RefSeq",
          "stats": {
            "gene_counts": {
              "protein_coding": 4150,
              "total": 4465
            }
          }
        },
        "assembly_accession": "GCF_910575425.1",
        "assembly_category": "representative genome",
        "assembly_level": "Scaffold",
        "bioproject_lineages": [
          {
            "bioprojects": [
              {
                "accession": "PRJEB45232",
                "title": "Genome assemblies for the Mouse Culture Collection (MGBC)"
              }
            ]
          }
        ],
        "biosample_accession": "SAMEA8801361",
        "chromosomes": [
          {
            "gc_count": "2261981",
            "length": "5258315",
            "name": "Un"
          }
        ],
        "contig_n50": 109360,
        "display_name": "MGBC000127",
        "estimated_size": "8705042",
        "gc_count": "2261981",
        "org": {
          "assembly_counts": {
            "node": 33,
            "subtree": 33
          },
          "isolate": "B76_1_BA_BP",
          "key": "85831",
          "parent_tax_id": "816",
          "rank": "SPECIES",
          "sci_name": "Bacteroides acidifaciens",
          "tax_id": "85831",
          "title": "Bacteroides acidifaciens"
        },
        "paired_assembly_accession": "GCA_910575425.1",
        "seq_length": "5258315",
        "submission_date": "2021-07-07",
        "submitter": "UNIVERSITY OF CAMBRIDGE"
      }
    }
parse json ncbi_datasets jq • 2.0k views
ADD COMMENT
4
Entering edit mode
2.9 years ago
vkkodali_ncbi ★ 3.8k

You were almost there! Try the following to get a tab-delimited output:

cat Bacteroides.json | jq -r '.assemblies[].assembly|[.assembly_accession,.org.tax_id,.org.sci_name]|@tsv'

An alternative to jq is to use dataformat. However, at this time dataformat cannot process the json emitted by datasets summary command. It can only process the json-lines report files produced by datasets download command.

To use dataformat in this situation:

## first download a dehydrated package as we are interested in only the data-report
datasets download genome taxon "Bacteroides" --reference --dehydrated
## then use dataformat to extract fields of interest
dataformat tsv genome --package ncbi_dataset.zip --fields assminfo-accession,tax-id,organism-name
ADD COMMENT
0
Entering edit mode

 Isn't dataformatsupposed to be used for parsing the Json file? What would the comparable command there?

ADD REPLY
0
Entering edit mode

dataformat excel genome --inputfile Bacteroides/ncbi_dataset/data/assembly_data_report.jsonl --fields organism-name,tax-id,assminfo-refseq-assm-accession -o ./lists/Bacteriodes.xlsx

Input file needs to be .jsonl or dataformat will not be able to parse it

ADD REPLY
0
Entering edit mode

The jq text worked! Thank you for your prompt response.

Checking only the json file using datasets summary reduced the directory size by about half when compared to using datasets download for the dehydrated protein.faa files for Bacteroides with roughly 50 references. Though it may be slightly larger in size, I do like that it is quite easy to rehydrate to obtain the fasta files after examination of the potential contents using dataformat.

ADD REPLY
0
Entering edit mode

Please accept the answer to provide closure to this thread (green checkmark).

ADD REPLY

Login before adding your answer.

Traffic: 2421 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6