Skip to content

Load balancing debugging info #728

Open
@angelhof

Description

@angelhof

PaSh currently does not do any rebalancing of outputs between stages of the pipeline with the same width. This could end up in pathological scenarios, e.g., when the input of a program cat IN | cmd1 | cmd2 is one line, cmd1 and cmd2 are both stateless, and cmd1 creates a bunch of lines that can then be processed by cmd2 in parallel, PaSh will not get any parallelism in this case.

To help identify such pathological scenarios it would be great if we could add a flag that adds logging nodes in parts of the dataflow that print how many lines and bytes go through them.

The steps to get this done would be to:

  1. Implement a command that simply forwards its input to its output (no buffering like dgsh-tee), but also measures and prints the number of bytes and lines at the end.
  2. Add this command after each stage of the dataflow graph to get the load for each different parallel line of the graph
  3. (optional) Create a simple post-processing tool that can present the output in a nice way (relative loads or even in a plot, see for example --graphviz option in current PaSh).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions