Question

Extracting and visualizing of specific genomic reference structures

0

Entering edit mode

4 days ago

s_135 ▴ 10

Steps followed: I have a pangenome constructed. I have the graphs in gbz, xg formats and have used them downstream. Now for visualization, I have used odgi for 1d visualisation as follows: Step 1: convert to gfa for odgi to use

vg view -g pangenome_graph.giraffe.gbz > pangenome_graph.vis.gfa

or

vg convert -f pangenome_graph.giraffe.gbz > pangenome_graph.vis.gfa

or

vg convert pangenome_graph.giraffe.gbz \
    --rgfa-path Pseudomolecule_01 \
    --rgfa-path Pseudomolecule_02 \
    --rgfa-path Pseudomolecule_03 \
    --rgfa-path Pseudomolecule_04 \
    --rgfa-path Pseudomolecule_05 \
    --rgfa-path Pseudomolecule_06 \
    --rgfa-path Pseudomolecule_07 \
    --rgfa-path Pseudomolecule_08 \
    --rgfa-path Pseudomolecule_09 \
    --rgfa-path Pseudomolecule_10 \
    --rgfa-path Pseudomolecule_11 \
    --rgfa-path Pseudomolecule_12 \
    --rgfa-path Pseudomolecule_13 \
    --rgfa-path Pseudomolecule_14 \
    --rgfa-path Pseudomolecule_15 \
    --rgfa-path Pseudomolecule_16 \
    --rgfa-path Pseudomolecule_17 \
    --rgfa-path Pseudomolecule_18 \
    --rgfa-path Pseudomolecule_19 \
    -f > pangenome_graph_filtered.gfa

Step 2: convert to odgi format og

odgi build -g /data/pangenome_graph_filtered.gfa -o /data/pangenome_graph_filtered.vis.og

Step 3: visualize

odgi viz -i /data/pangenome_graph_filtered.vis.og -o /data/pangenome_graph_filtered.vis.png

Outputs: Following the first command in step 1 gave a graph with both Pseudomolecules and Scaffolds but smaller file size. The second command gave a similar graph just larger file size. enter image description here

The third command removed all the Pseudomolecules and retained only Scaffolds. enter image description here

Expected output: Retain only the Pseudomolecules_01 through 19 and remove all the Scaffolds_000 through 917. I am trying to create output like the ones here: https://zenodo.org/records/8288999

How do I go about doing this? I have tried vg paths also, but not able to filter it the way I want.

Note: I saw I can extract a subgraph as shown here (https://github.com/vgteam/vg/wiki/Visualization) but don't know how to use it since I have numerous 'n' no of node IDs.

visualization odgi pangenome visualisation vg • 201 views

ADD COMMENT • link 4 days ago by s_135 ▴ 10

0

Entering edit mode

For reference: vg convert

usage: /usr/lib/vg convert [options] <input-graph>
input options:
    -g, --gfa-in           input in GFA format
    -r, --in-rgfa-rank N   import rgfa tags with rank <= N as paths [default=0]
    -b, --gbwt-in FILE     input graph is a GBWTGraph using the GBWT in FILE
        --ref-sample STR   change haplotypes for this sample to reference paths (may repeat)
gfa input options (use with -g):
    -T, --gfa-trans FILE   write gfa id conversions to FILE
output options:
    -v, --vg-out           output in VG's original Protobuf format [DEPRECATED: use -p instead].
    -a, --hash-out         output in HashGraph format
    -p, --packed-out       output in PackedGraph format [default]
    -x, --xg-out           output in XG format
    -f, --gfa-out          output in GFA format
    -H, --drop-haplotypes  do not include haplotype paths in the output
                           (useful with GBWTGraph / GBZ inputs)
gfa output options (use with -f):
    -P, --rgfa-path STR    write given path as rGFA tags instead of lines
                           (multiple allowed, only rank-0 supported)
    -Q, --rgfa-prefix STR  write paths with given prefix as rGFA tags instead of lines
                           (multiple allowed, only rank-0 supported)
    -B, --rgfa-pline       paths written as rGFA tags also written as lines
    -W, --no-wline         Write all paths as GFA P-lines instead of W-lines.
                           Allows handling multiple phase blocks and subranges used together.
    --gbwtgraph-algorithm  Always use the GBWTGraph library GFA algorithm.
                           Not compatible with other GFA output options or non-GBWT graphs.
    --vg-algorithm         Always use the VG GFA algorithm. Works with all options and graph types,
                           but can't preserve original GFA coordinates.
    --no-translation       When using the GBWTGraph algorithm, convert the graph directly to GFA.
                           Do not use the translation to preserve original coordinates.
alignment options:
    -G, --gam-to-gaf FILE  convert GAM FILE to GAF
    -F, --gaf-to-gam FILE  convert GAF FILE to GAM
general options:
    -t, --threads N        use N threads (defaults to numCPUs)

vg view

usage: /usr/bin/vg view [options] [ <graph.vg> | <graph.json> | <aln.gam> | <read1.fq> [<read2.fq>] ]
        options:
            -g, --gfa                  output GFA format (default)
            -F, --gfa-in               input GFA format, reducing overlaps if they occur
            -v, --vg                   output VG format [DEPRECATED, use vg convert instead]
            -V, --vg-in                input VG format only
            -j, --json                 output JSON format
            -J, --json-in              input JSON format
            -c, --json-stream          streaming conversion of a VG format graph in line delimited JSON format
                                       (this cannot be loaded directly via -J)
            -G, --gam                  output GAM format (vg alignment format: Graph Alignment/Map)
            -Z, --translation-in       input is a graph translation description
            -t, --turtle               output RDF/turtle format (can not be loaded by VG)
            -T, --turtle-in            input turtle format.
            -r, --rdf_base_uri         set base uri for the RDF output
            -a, --align-in             input GAM format
            -A, --aln-graph GAM        add alignments from GAM to the graph
            -q, --locus-in             input stream is Locus format
            -z, --locus-out            output stream Locus format
            -Q, --loci FILE            input is Locus format for use by dot output
            -d, --dot                  output dot format
            -S, --simple-dot           simplify the dot output; remove node labels, simplify alignments
            -u, --noseq-dot            shows size information instead of sequence in the dot output
            -e, --ascii-labels         use labels for paths or superbubbles with char/colors rather than emoji
            -Y, --ultra-label          label nodes with emoji/colors that correspond to ultrabubbles
            -m, --skip-missing         skip mappings to nodes not in the graph when drawing alignments
            -C, --color                color nodes that are not in the reference path (DOT OUTPUT ONLY)
            -p, --show-paths           show paths in dot output
            -w, --walk-paths           add labeled edges to represent paths in dot output
            -n, --annotate-paths       add labels to normal edges to represent paths in dot output
            -M, --show-mappings        with -p print the mappings in each path in JSON
            -I, --invert-ports         invert the edge ports in dot so that ne->nw is reversed
            -s, --random-seed N        use this seed when assigning path symbols in dot output
            -b, --bam                  input BAM or other htslib-parseable alignments
            -f, --fastq-in             input fastq (output defaults to GAM). Takes two
                                       positional file arguments if paired
            -X, --fastq-out            output fastq (input defaults to GAM)
            -i, --interleaved          fastq is interleaved paired-ended
            -L, --pileup               output VG Pileup format
            -l, --pileup-in            input VG Pileup format
            -B, --distance-in          input distance index
            -R, --snarl-in             input VG Snarl format
            -E, --snarl-traversal-in   input VG SnarlTraversal format
            -K, --multipath-in         input VG MultipathAlignment format (GAMP)
            -k, --multipath            output VG MultipathAlignment format (GAMP)
            -D, --expect-duplicates    don't warn if encountering the same node or edge multiple times
            -x, --extract-tag TAG      extract and concatenate messages with the given tag
            --verbose                  explain the file being read with --extract-tag
            --threads N                for parallel operations use this many threads [1]

vg paths

usage: /usr/bin/vg paths [options]
    options:
      input:
        -x, --xg FILE            use the paths and haplotypes in this graph FILE. Supports GBZ haplotypes.
                                 (Also accepts -v, --vg)
        -g, --gbwt FILE          use the threads in the GBWT index in FILE
                                 (graph also required for most output options; -g takes priority over -x)
      output graph (.vg format)
        -V, --extract-vg         output a path-only graph covering the selected paths
        -d, --drop-paths         output a graph with the selected paths removed
        -r, --retain-paths       output a graph with only the selected paths retained
        -n, --normalize-paths    output a graph where all equivalent paths in a site a merged (using selected paths to snap to if possible)
      output path data:
        -X, --extract-gam        print (as GAM alignments) the stored paths in the graph
        -A, --extract-gaf        print (as GAF alignments) the stored paths in the graph
        -L, --list               print (as a list of names, one per line) the path (or thread) names
        -E, --lengths            print a list of path names (as with -L) but paired with their lengths
        -M, --metadata           print a table of path names and their metadata
        -C, --cyclicity          print a list of path names (as with -L) but paired with flag denoting the cyclicity
        -F, --extract-fasta      print the paths in FASTA format
        -c, --coverage           print the coverage stats for selected paths (not including cylces)
      path selection:
        -p, --paths-file FILE    select the paths named in a file (one per line)
        -Q, --paths-by STR       select the paths with the given name prefix
        -S, --sample STR         select the haplotypes or reference paths for this sample
        -a, --variant-paths      select the variant paths added by 'vg construct -a'
        -G, --generic-paths      select the generic, non-reference, non-haplotype paths
        -R, --reference-paths    select the reference paths
        -H, --haplotype-paths    select the haplotype paths paths
        -t, --threads N          number of threads to use [all available]. applies only to snarl finding within -n

ADD REPLY • link 4 days ago by s_135 ▴ 10

0

Entering edit mode

odgi build

  odgi build {OPTIONS}

    Construct a dynamic succinct variation graph in ODGI format from a GFAv1.

  OPTIONS:

      [ MANDATORY OPTIONS ]
        -g[FILE], --gfa=[FILE]            GFAv1 FILE containing the nodes, edges
                                          and paths to build a dynamic succinct
                                          variation graph from.
        -o[FILE], --out=[FILE]            Write the dynamic succinct variation
                                          graph to this *FILE*. A file ending
                                          with *.og* is recommended.
      [ Graph Sorting ]
        -O, --optimize                    Compact the graph id space into a
                                          dense integer range.
        -s, --sort                        Apply a general topological sort to
                                          the graph and order the node ids
                                          accordingly. A bidirected adaptation
                                          of Kahn’s topological sort (1962) is
                                          used, which can handle components with
                                          no heads or tails. Here, both heads
                                          and tails are taken into account.
      [ Threading ]
        -t[N], --threads=[N]              Number of threads to use for parallel
                                          operations.
      [ Processing Information ]
        -P, --progress                    Write the current progress to stderr.
        -d, --debug                       Verbosely print graph information to
                                          stderr. This includes the maximum
                                          node_id, the minimum node_id, the
                                          handle to node_id mapping, the deleted
                                          nodes and the path metadata.
      [ Program Information ]
        -h, --help                        Print a help message for odgi build

ADD REPLY • link 4 days ago by s_135 ▴ 10

0

Entering edit mode

odgi viz

    Visualize a variation graph in 1D.
      [ MANDATORY OPTIONS ]
        -i[FILE], --idx=[FILE]            Load the succinct variation graph in
                                          ODGI format from this *FILE*. The file
                                          name usually ends with *.og*. It also
                                          accepts GFAv1, but the on-the-fly
                                          conversion to the ODGI format requires
                                          additional time!
        -o[FILE], --out=[FILE]            Write the visualization in PNG format
                                          to this *FILE*.
      [ Visualization Options ]
        -x[N], --width=[N]                Set the width in pixels of the output
                                          image (default: 1500).
        -y[N], --height=[N]               Set the height in pixels of the output
                                          image (default: 500).
        -a[N], --path-height=[N]          The height in pixels for a path.
        -X[N], --path-x-padding=[N]       The padding in pixels on the x-axis
                                          for a path.
        -n, --no-path-borders             Don't show path borders.
        -b, --black-path-borders          Draw path borders in black (default is
                                          white).
        -R, --pack-paths                  Pack all paths rather than displaying
                                          a single path per row.
        -L[FLOAT],
        --link-path-pieces=[FLOAT]        Show thin links of this relative width
                                          to connect path pieces.
        -A[STRING],
        --alignment-prefix=[STRING]       Apply alignment related visual motifs
                                          to paths which have this name prefix.
                                          It affects the [**-S, --show-strand**]
                                          and [**-d, –change-darkness**]
                                          options.
        -S, --show-strand                 Use red and blue coloring to display
                                          forward and reverse alignments. This
                                          parameter can be set in combination
                                          with [**-A,
                                          –alignment-prefix**=*STRING*].
        -z,
        --color-by-mean-inversion-rate    Change the color respect to the node
                                          strandness (black for forward, red for
                                          reverse); in binned mode (**-b,
                                          --binned-mode**), change the color
                                          respect to the mean inversion rate of
                                          the path for each bin, from black (no
                                          inversions) to red (bin mean inversion
                                          rate equals to 1).
        -N, --color-by-uncalled-bases     Change the color with respect to the
                                          uncalled bases of the path for each
                                          bin, from black (no uncalled bases) to
                                          green (all uncalled bases).
        -s[CHAR],
        --color-by-prefix=[CHAR]          Color paths by their names looking at
                                          the prefix before the given character
                                          CHAR.
        -M[FILE], --prefix-merges=[FILE]  Merge paths beginning with prefixes
                                          listed (one per line) in *FILE*.
        -I[PREFIX],
        --ignore-prefix=[PREFIX]          Ignore paths starting with the given
                                          *PREFIX*.
      [ Intervals Selection Options ]
        -r[STRING], --path-range=[STRING] Nucleotide range to visualize:
                                          ``STRING=[PATH:]start-end``.
                                          ``\*-end`` for ``[0,end]``;
                                          ``start-*`` for
                                          ``[start,pangenome_length]``. If no
                                          PATH is specified, the nucleotide
                                          positions refer to the pangenome’s
                                          sequence (i.e., the sequence obtained
                                          arranging all the graph’s node from
                                          left to right).
      [ Path Selection Options ]
        -p[FILE],
        --paths-to-display=[FILE]         List of paths to display in the
                                          specified order; the file must contain
                                          one path name per line and a subset of
                                          all paths can be specified.
      [ Path Names Viz Options ]
        -H, --hide-path-names             Hide the path names on the left of the
                                          generated image.
        -C, --color-path-names-background Color path names background with the
                                          same color as paths.
        -c[N],
        --max-num-of-characters=[N]       Maximum number of characters to
                                          display for each path name (max 128
                                          characters). The default value is *the
                                          length of the longest path name* (up
                                          to 128 characters).
      [ Binned Mode Options ]
        -w[bp], --bin-width=[bp]          The bin width specifies the size of
                                          each bin in the binned mode. If it is
                                          not specified, the bin width is
                                          calculated from the width in pixels of
                                          the output image.r
        -m, --color-by-mean-depth         Change the color with respect to the
                                          mean coverage of the path for each
                                          bin, using the colorbrewer palette
                                          specified in -B --colorbrewer-palette
        -B[SCHEME:N],
        --colorbrewer-palette=[SCHEME:N]  Use the colorbrewer palette specified
                                          by the given SCHEME, with the number
                                          of levels N. Specifiy 'show' to see
                                          available palettes.
        -G, --no-grey-depth               Use the colorbrewer palette for <0.5x
                                          and ~1x coverage bins. By default,
                                          these bins are light and neutral grey.
      [ Gradient Mode Options ]
        -d, --change-darkness             Change the color darkness based on
                                          nucleotide position in the path. When
                                          it is used in binned mode, the mean
                                          inversion rate of the bin node is
                                          considered to set the color gradient
                                          starting position: when this rate is
                                          greater than 0.5, the bin is
                                          considered inverted, and the color
                                          gradient starts from the right-end of
                                          the bin. This parameter can be set in
                                          combination with [**-A,
                                          –alignment-prefix**=*STRING*].
        -l, --longest-path                Use the longest path length to change
                                          the color darkness.
        -u, --white-to-black              Change the color darkness from white
                                          (for the first nucleotide position) to
                                          black (for the last nucleotide
                                          position).
      [ Compressed Mode Options ]
        -O, --compressed-mode             Compress the view vertically,
                                          summarizing the path coverage across
                                          all paths displaying the information
                                          using only one path 'COMPRESSED_MODE'.
                                          A heatmap color-coding from
                                          https://colorbrewer2.org/#type=diverging&scheme=RdBu&n=11
                                          is used. Alternatively, one can enter
                                          a colorbrewer palette via -B,
                                          --colorbrewer-palette.
      [ Threading ]
        -t[N], --threads=[N]              Number of threads to use for parallel
                                          operations.

ADD REPLY • link 4 days ago by s_135 ▴ 10