Skip to content

implement Spark Contract on taskgraph

xiaoyunwu edited this page Jan 9, 2015 · 5 revisions

Implementing Spark on Top of TaskGraph

Spark is essentially a RDD with shared variable (broadcast variables, and accumulators), see http://spark.apache.org/docs/latest/programming-guide.html for details. Note the shared variable is the main reason that spark is logically different from mapreduce, aside from the fact that it host data in memory.

To implement Spark, we simply need to implement a thin layer on top of taskgraph, which take care of the follow things:

  1. Operate on a tree structure, have one parameter server as the top of tree, the rest are slaves that loads a particular shard of data set in to memory, the fault tolerance support of taskgraph effectively turn true into RDD (resilient distributed dataset).
  2. Parameter server host the master copy of BroadcastVariable, and serve it to its children. Each slave in the internal node of tree also serve these readonly BroadcastVaiable to their children.
  3. After slave get broadcast variable, it does computation on its shards of the data, and store the output into accumulator. Accumulator also merge that output of the computation from his direct children before sends back to parent. Eventually parameter server the aggregated output. It will used this to decide what to do in next epoch.
  4. We can implement Parameter server with optimization algorithms, so that it can be used to optimize different ml models.
  5. The interface we need to expose to application developer at this level will be extremely simple:
    type DatumHandler interface {
    Process(item Datum, param BroadcastVariable, gradient Accumulator)
    }
    which basically take a data point and readonly broadcast variable, and save the result into accumulator.
Clone this wiki locally