Skip to content

Features Walkthrough

Albert Schimpf edited this page Feb 1, 2021 · 62 revisions

To understand all features, please first familiarize yourself with the Arrows and Nodes Model the framework is based on.

All examples can be run by saving the specification to a .yf file in a directory and executing Scraper in that directory (e.g. docker run -v "$PWD":/rt/ --rm albsch/scraper:latest).

Taskflow Specification

A taskflow can be written in both JSON (file ending .jf or .json) and YML (file ending .yf or .yaml, .yml). These are the 2 core parsers provided with Scraper with full functionality included. For more types of parsers or how to implement your own format, see the section Scraper Formats.

An example specification in YML can be defined as follows:

name: helloworld
# entry: start
# globalNodeConfigurations: {}
# imports: {}
graphs:
  start:
    - type: Echo
      log: "Hello world!"
    - type: Echo
      log: "I accept the output of the previous node!"
    - type: Echo
      log: "I'm the last node in the graph!"
  other:
    - type: Echo
      log: "I'm not reachable and not used :("

Mandatory:

  • name: Identifier for the taskflow specification
  • graphs: A map where keys are identifiers a single graph. Values are lists of nodes.

Optional:

  • entry: Which graph is going to be used for the first entry task on startup. Default value is start.
  • globalNodeConfigurations: Used to configure multiple nodes at once. Local node configuration has precedence over global node configuration. Default is no global node configurations. See the Global Node Configuration section.
  • imports: Used to import other taskflows into this taskflow. See the Importing Other Taskflows section.

Node Configuration

Each node in a graph can be configured depending on its implementation. A node is a key-value map in the specification. The documentation for each configuration of the core and some additional development Scraper nodes can be found here. The documentation (for your own nodes) can be generated, for more information see the Generating Node Documentation section of the developer guide. A template for developing and packaging your own custom nodes quickly can be found in the Developing Custom Nodes section.

Example: Echo creating a static output at key obj of type Map<String, String> (Echo Documentation):

    - type: Echo
      log: "Creating object"
      put: obj
      value:   
        id: "10"
        name: "smith"
        age: "20"

The types of the node configuration has to match the types defined in the node implementation, which can be checked with the generated documentation:

  • put: Where to put the object generated by value
  • value: The object to put at the put location of type A

Failing to match the types will throw an exception on starting the taskflow.

If a node is extending another node implementation, then it inherits the configuration of its parent. Generally, every node has a basic node configuration provided by Node.


Nodes and Forward Contract

Every node by default forwards to the next node using a dependent arrow. The next target can be configured via the configuration keys forward and goTo (Node Documentation) where next is defined as follows:

  1. If forward is false, then do not forward
  2. If goTo is not set, then forward to the following node in the graph (list) this node is in
    • If this node is the last node in the graph, then do not forward
  3. If goTo is set, then resolve the address and forward to that target address (see Address Schema)

If the node implementation does not create other dependent arrows (i.e. modifying control flow), then not forwarding is the same as the flow terminating.

name: goto
graphs:
  start:
    - type: Echo
      goTo: other
    - type: Echo
      log: "I'm a labelled node!"
      label: anecho
    - type: Echo
      log: "I'm the last node in the graph!"
  other:
    - type: Echo
      log: "A short detour to another graph"
      goTo: start.anecho
    - type: Echo
      log: "I'm not reachable :("

It is usually enough and recommended to only address up to graph labels and not address single nodes directly.


Flow Map and Flows

A flow map is a key-value map and travels along arrows. For a dispatched arrow the flow map is copied for the newly created flow.

If multiple dependent arrows originate from the same node, then the node implementation decides which order the flow map is forwarded in. This is usually marked in the generated control flow graph.

Flow maps are used to carry data around and for dynamic node configuration dependent on the currently accessing flow map via templates.

name: echotest
graphs:
  start:
    - type: Pipe
      pipeTargets: [ok, ok2]    
    # continue with the pipe result
    - type: Echo
      log: "My flow map contains both ok and ok2!"


  ok:
    - type: Echo
      put: ok # fills key ok in the accepting flow map
      value: hello

  ok2:
    - type: Echo
      put: ok2 # fills key ok2 in the accepting flow map
      value: hello

Nodes used: Echo and Pipe.

This example has no observable behavior. To inspect and use the contents of the flow map, templates can be used.


Templates and Dynamic/Dependent Node Configuration

To use flow map content and to make the configuration of nodes dependent on input, templates are used. A templating engine is embedded into Strings and follows a simple grammar.

  • {X}: Value of flow map at key X
  • {X}[Y]: List element at index Y of list X. It follows that X has to be a nested template which has to evaluate to a list and Y an Integer (or template that evaluates to an Integer)
  • {X@Y}: Map element at key Y of map X. It follows that X has to be a nested template which has to evaluate to a map and Y a String (or template that evaluates to a String)
  • {X}{Y}: String concatenation
  • helloworld: Static template
name: echoinspect
graphs:
  start:
    - type: Pipe
      pipeTargets: [ok, ok2]    
    - type: Echo
      log: # print out a map
        info: "My input contains a list at oklist and a map at okmap"
        # {oklist} resolves to the list, {_}[0] inspects the content
        msg1: "{{oklist}}[0] {{oklist}}[1]" 
        msg2: "{okmap}"
        msg3: "Is {{okmap}@age} years old"

  # use JSON embedding to save a bit of space
  ok: [ {type: Echo, put: oklist, value: [hello, world] } ]
  # age: 25 does not work, as it is an integer and not a String
  # using 25 causes the typechecker to throw an error, prohibiting the execution of the workflow
  ok2: [ {type: Echo, put: okmap, value: {name: smith, age: "25"} } ]

For the log configuration of Node we use the fact that it has the type T<?> (which is the same as T<Object> in Scraper). This means String templates can be inside lists and maps, too. Every nested template is evaluated. In contrast, put of Echo has the type T<A>, so only homogeneous lists and maps are allowed by the type checker. If you need to build complex JSON objects, JsonObject can be used. It has the same API but the type is T<Object>, therefore the resulting output has less type information and is more dangerous to use.

Assume that in the taskflow before you gathered an id, a list of String comments at comments and a String title at title. To save a JSON document to disk, you first have to build it, JsonObject can be used for this purpose:

name: staticoutput
graphs:
  start:
    # replace these Echos with another taskflow that gathers information
    - {type: Echo, put: id, value: 1}
    - {type: Echo, put: title, value: "hello world"}
    - {type: Echo, put: comments, value: ["lorem", "ipsum"]}

    - type: JsonObject # Using Echo results in a type error!
      put: jsondoc
      value:
        id: "{id}"
        title: "{title}"
        comments: "{comments}" # a list!
        new: true # some static information
    - type: ObjectToJsonString
      object: "{jsondoc}"
      #result: "result"
    - type: WriteLineToFile
      output: "out.json"
      line: "{result}"
      overwrite: true

Nodes used: Echo, JsonObject,ObjectToJsonString, WriteLineToFile


Node Addressing

The absolute address schema is taskflow.graph.label or taskflow.graph.index.

Relative address schemas are allowed.

addr as seen from a node can be (checked in this order):

  1. local node with label addr
  2. graph id addr
  3. imported taskflow addr

addr1.addr2 as seen from a node can be (checked in this order):

  1. node addr2 in graph addr1`
  2. graph addr2 in imported taskflow addr1
name: myflow
graphs:
  start:
    - type: Pipe
      pipeTargets: ["localnode", "localgraph", "graph.nodeingraph", "myflow.graph.0"]
    - { type: Echo, log: "Finished!", forward: false }
    - { type: Echo, log: "I'm addressed directly", label: localnode }

  localgraph:
    - { type: Echo, log: "localgraph!" }

  graph:
    - { type: Echo, log: "first node in graph!" }
    - { type: Echo, log: "I'm accessed twice!", label: nodeingraph }

Addressing nodes inside graphs could make the resulting control flow less understandable:

cfg


Importing Other Taskflows

Importing is used to modularize the taskflow. The addressing can only happen from parent to child: nodes in a child taskflow cannot address nodes in a parent task flow.

The design of importing taskflows is under discussion and feedback is welcome.

Currently, imports is a map where the key specifies the path of the taskflow to import and the value is unused.

myflow.yf:

name: myflow
imports:
  import.yf:
graphs:
  start:
    - { type: Echo, log: hello, goTo: importedflow.gohere }

import.yf:

name: importedflow
graphs:
  gohere:
    - { type: Echo, log: "I'm here now" }

When using more than one specification, the main specification has to be supplied as a command line argument, e.g. docker run -v "$PWD":/rt/ --rm albsch/scraper:latest myflow.yf.

cfg


Flow Graph Generation

The core framework is able to generate flow graphs to visualize the specification via the nodes and arrows model.

root.yf:

name: root
imports:
  child.yf:
graphs:
  start:
    - { type: Echo, log: "Starting taskflow"}
    - { type: Pipe, pipeTargets: [arg1, root.arg2.0] }
    - { type: Echo, log: "Finished: {arg1} {arg2}!"}

  arg1:
    - { type: Echo, put: elements1, value: ["1","2","4","5","6"] }
    - { type: Map, list: "{elements1}", mapTarget: child.api, putElement: a }
    - { type: Echo, put: arg1, value: hello}

  arg2:
    - { type: Echo }
    - { type: Echo, put: elements2, value: ["wo", "rl", "d"] }
    - { type: Map, list: "{elements2}", mapTarget: child.api, putElement: a }
    - { type: Echo, put: arg2, value: "{{elements2}}[0]" }

child.yf:

name: child
graphs:
  # API
  # a :: String
  # static checks ensure that this specification is only executed if the importers provide a String at location `a`
  api:
    - { type: Echo, log: "Got input {a}"}

Executing cfg will yield (e.g. docker run -v "$PWD":/rt/ --rm albsch/scraper:latest root.yf cfg exit)

cfg

Currently, crossed arrows are depicted as red arrows in the flow graph generator.


Global Node Configuration

Global node configurations can be specified by node type or regex match on node types. Regex have to be surrounded by //.

name: myflow
globalNodeConfigurations:
  Echo:
    log: "Global logging"
  "/Pip.*/":
    log: "My target is tar"
    pipeTargets: [tar]
graphs:
  start:
    - { type: Echo, log: "hello"}
    - { type: Pipe }
  tar:
    - { type: Echo }