-
Notifications
You must be signed in to change notification settings - Fork 2
Features Walkthrough
To understand all features, please first familiarize yourself with the Arrows and Nodes Model the framework is based on.
All examples can be run by saving the specification to a .yf
file in a directory and executing Scraper
in that directory (e.g. docker run -v "$PWD":/rt/ --rm albsch/scraper:latest
).
A taskflow can be written in both JSON (file ending .jf
or .json
) and YML (file ending .yf
or .yaml
, .yml
).
These are the 2 core parsers provided with Scraper with full functionality included.
For more types of parsers or how to implement your own format, see the section Scraper Formats.
An example specification in YML can be defined as follows:
name: helloworld
# entry: start
# globalNodeConfigurations: {}
# imports: {}
graphs:
start:
- type: Echo
log: "Hello world!"
- type: Echo
log: "I accept the output of the previous node!"
- type: Echo
log: "I'm the last node in the graph!"
other:
- type: Echo
log: "I'm not reachable and not used :("
Mandatory:
-
name
: Identifier for the taskflow specification -
graphs
: A map where keys are identifiers a single graph. Values are lists of nodes.
Optional:
-
entry
: Which graph is going to be used for the first entry task on startup. Default value isstart
. -
globalNodeConfigurations
: Used to configure multiple nodes at once. Local node configuration has precedence over global node configuration. Default is no global node configurations. See the Global Node Configuration section. -
imports
: Used to import other taskflows into this taskflow. See the Importing Other Taskflows section.
Each node in a graph can be configured depending on its implementation. A node is a key-value map in the specification. The documentation for each configuration of the core and some additional development Scraper nodes can be found here. The documentation (for your own nodes) can be generated, for more information see the Generating Node Documentation section of the developer guide. A template for developing and packaging your own custom nodes quickly can be found in the Developing Custom Nodes section.
Example: Echo
creating a static output at key obj
of type Map<String, String>
(Echo Documentation):
- type: Echo
log: "Creating object"
put: obj
value:
id: "10"
name: "smith"
age: "20"
The types of the node configuration has to match the types defined in the node implementation, which can be checked with the generated documentation:
-
put
: Where to put the object generated byvalue
-
value
: The object to put at theput
location of typeA
Failing to match the types will throw an exception on starting the taskflow.
If a node is extending another node implementation, then it inherits the configuration of its parent. Generally, every node has a basic node configuration provided by Node.
Every node by default forwards to the next
node using a dependent
arrow.
The next
target can be configured via the configuration keys forward
and goTo
(Node Documentation) where next
is defined as follows:
- If
forward
isfalse
, then do not forward - If
goTo
is not set, then forward to the following node in the graph (list) this node is in- If this node is the last node in the graph, then do not forward
- If
goTo
is set, then resolve the address and forward to that target address (see Address Schema)
If the node implementation does not create other dependent
arrows (i.e. modifying control flow), then not forwarding is the same as the flow terminating.
name: goto
graphs:
start:
- type: Echo
goTo: other
- type: Echo
log: "I'm a labelled node!"
label: anecho
- type: Echo
log: "I'm the last node in the graph!"
other:
- type: Echo
log: "A short detour to another graph"
goTo: start.anecho
- type: Echo
log: "I'm not reachable :("
It is usually enough and recommended to only address up to graph labels and not address single nodes directly.
A flow map is a key-value map and travels along arrows.
For a dispatched
arrow the flow map is copied for the newly created flow.
If multiple dependent
arrows originate from the same node, then the node implementation decides which order the flow map is forwarded in.
This is usually marked in the generated control flow graph.
Flow maps are used to carry data around and for dynamic node configuration dependent on the currently accessing flow map via templates.
name: echotest
graphs:
start:
- type: Pipe
pipeTargets: [ok, ok2]
# continue with the pipe result
- type: Echo
log: "My flow map contains both ok and ok2!"
ok:
- type: Echo
put: ok # fills key ok in the accepting flow map
value: hello
ok2:
- type: Echo
put: ok2 # fills key ok2 in the accepting flow map
value: hello
This example has no observable behavior. To inspect and use the contents of the flow map, templates can be used.
To use flow map content and to make the configuration of nodes dependent on input, templates are used. A templating engine is embedded into Strings and follows a simple grammar.
-
{X}
: Value of flow map at keyX
-
{X}[Y]
: List element at indexY
of listX
. It follows thatX
has to be a nested template which has to evaluate to a list andY
an Integer (or template that evaluates to an Integer) -
{X@Y}
: Map element at keyY
of mapX
. It follows thatX
has to be a nested template which has to evaluate to a map andY
a String (or template that evaluates to a String) -
{X}{Y}
: String concatenation -
helloworld
: Static template
name: echoinspect
graphs:
start:
- type: Pipe
pipeTargets: [ok, ok2]
- type: Echo
log: # print out a map
info: "My input contains a list at oklist and a map at okmap"
# {oklist} resolves to the list, {_}[0] inspects the content
msg1: "{{oklist}}[0] {{oklist}}[1]"
msg2: "{okmap}"
msg3: "Is {{okmap}@age} years old"
# use JSON embedding to save a bit of space
ok: [ {type: Echo, put: oklist, value: [hello, world] } ]
# age: 25 does not work, as it is an integer and not a String
# using 25 causes the typechecker to throw an error, prohibiting the execution of the workflow
ok2: [ {type: Echo, put: okmap, value: {name: smith, age: "25"} } ]
For the log
configuration of Node
we use the fact that it has the type T<?>
(which is the same as T<Object>
in Scraper).
This means String templates can be inside lists and maps, too.
Every nested template is evaluated.
In contrast, put
of Echo
has the type T<A>
, so only homogeneous lists and maps are allowed by the type checker.
If you need to build complex JSON objects, JsonObject
can be used.
It has the same API but the type is T<Object>
, therefore the resulting output has less type information and is more dangerous to use.
Assume that in the taskflow before you gathered an id
, a list of String comments at comments
and a String title at title
. To save a JSON document to disk, you first have to build it, JsonObject
can be used for this purpose:
name: staticoutput
graphs:
start:
# replace these Echos with another taskflow that gathers information
- {type: Echo, put: id, value: 1}
- {type: Echo, put: title, value: "hello world"}
- {type: Echo, put: comments, value: ["lorem", "ipsum"]}
- type: JsonObject # Using Echo results in a type error!
put: jsondoc
value:
id: "{id}"
title: "{title}"
comments: "{comments}" # a list!
new: true # some static information
- type: ObjectToJsonString
object: "{jsondoc}"
#result: "result"
- type: WriteLineToFile
output: "out.json"
line: "{result}"
overwrite: true
Nodes used: Echo, JsonObject,ObjectToJsonString, WriteLineToFile
The absolute address schema is
taskflow.graph.label
or taskflow.graph.index
.
Relative address schemas are allowed.
addr
as seen from a node can be (checked in this order):
- local node with label
addr
- graph id
addr
- imported taskflow
addr
addr1.addr2
as seen from a node can be (checked in this order):
- node
addr2 in graph
addr1` - graph
addr2
in imported taskflowaddr1
name: myflow
graphs:
start:
- type: Pipe
pipeTargets: ["localnode", "localgraph", "graph.nodeingraph", "myflow.graph.0"]
- { type: Echo, log: "Finished!", forward: false }
- { type: Echo, log: "I'm addressed directly", label: localnode }
localgraph:
- { type: Echo, log: "localgraph!" }
graph:
- { type: Echo, log: "first node in graph!" }
- { type: Echo, log: "I'm accessed twice!", label: nodeingraph }
Addressing nodes inside graphs could make the resulting control flow less understandable:
Importing is used to modularize the taskflow. The addressing can only happen from parent to child: nodes in a child taskflow cannot address nodes in a parent task flow.
The design of importing taskflows is under discussion and feedback is welcome.
Currently, imports
is a map where the key specifies the path of the taskflow to import and the value is unused.
myflow.yf
:
name: myflow
imports:
import.yf:
graphs:
start:
- { type: Echo, log: hello, goTo: importedflow.gohere }
import.yf
:
name: importedflow
graphs:
gohere:
- { type: Echo, log: "I'm here now" }
When using more than one specification, the main specification has to be supplied as a command line argument,
e.g. docker run -v "$PWD":/rt/ --rm albsch/scraper:latest myflow.yf
.
The core framework is able to generate flow graphs to visualize the specification via the nodes and arrows model.
root.yf
:
name: root
imports:
child.yf:
graphs:
start:
- { type: Echo, log: "Starting taskflow"}
- { type: Pipe, pipeTargets: [arg1, root.arg2.0] }
- { type: Echo, log: "Finished: {arg1} {arg2}!"}
arg1:
- { type: Echo, put: elements1, value: ["1","2","4","5","6"] }
- { type: Map, list: "{elements1}", mapTarget: child.api, putElement: a }
- { type: Echo, put: arg1, value: hello}
arg2:
- { type: Echo }
- { type: Echo, put: elements2, value: ["wo", "rl", "d"] }
- { type: Map, list: "{elements2}", mapTarget: child.api, putElement: a }
- { type: Echo, put: arg2, value: "{{elements2}}[0]" }
child.yf
:
name: child
graphs:
# API
# a :: String
# static checks ensure that this specification is only executed if the importers provide a String at location `a`
api:
- { type: Echo, log: "Got input {a}"}
Executing cfg
will yield (e.g. docker run -v "$PWD":/rt/ --rm albsch/scraper:latest root.yf cfg exit
)
Currently, crossed arrows are depicted as red
arrows in the flow graph generator.
Global node configurations can be specified by node type or regex match on node types.
Regex have to be surrounded by //
.
name: myflow
globalNodeConfigurations:
Echo:
log: "Global logging"
"/Pip.*/":
log: "My target is tar"
pipeTargets: [tar]
graphs:
start:
- { type: Echo, log: "hello"}
- { type: Pipe }
tar:
- { type: Echo }