Apache Wayang (incubating)

The first open-source cross-platform data processing system

Description

In contrast to traditional data processing systems that provide one dedicated execution engine, Apache Wayang (incubating) can transparently and seamlessly integrate multiple execution engines and use them to perform a single task. We call this cross-platform data processing. In Wayang, users can specify any data processing application using one of Wayang's APIs and then Wayang will choose the data processing platform(s), e.g., Postgres or Apache Spark, that best fits the application. Finally, Wayang will perform the execution, thereby hiding the different platform-specific APIs and coordinating inter-platform communication.

Apache Wayang (incubating) aims at freeing data engineers and software developers from the burden of learning all different data processing systems, their APIs, strengths and weaknesses; the intricacies of coordinating and integrating different processing platforms; and the inflexibility when trying a fixed set of processing platforms. As of now, Wayang has built-in support for the following processing platforms:

Apache Wayang (incubating) can be used via the following APIs:

Java native
Java scala-like
Scala
SQL (limited support of simple select-project queries for now)

Quick Guide for Running Wayang

For a quick guide on how to run WordCount see here.

Quick Guide for Developing with Wayang

For a quick guide on how to use Wayang in your Java/Scala project see here.

Installing Wayang

You first have to build the binaries as shown here. Once you have the binaries built, follow these steps to install Wayang:

tar -xvf wayang-0.6.1-snapshot.tar.gz
cd wayang-0.6.1-SNAPSHOT

In linux

echo "export WAYANG_HOME=$(pwd)" >> ~/.bashrc
echo "export PATH=${PATH}:${WAYANG_HOME}/bin" >> ~/.bashrc
source ~/.bashrc

In MacOS

echo "export WAYANG_HOME=$(pwd)" >> ~/.zshrc
echo "export PATH=${PATH}:${WAYANG_HOME}/bin" >> ~/.zshrc
source ~/.zshrc

Requirements at Runtime

Since Apache Wayang (incubating) is not an execution engine itself but rather manages the execution engines for you, it is important to have the necessary requirements installed.

Apache Wayang supports Java versions 8 and above. However, the Wayang team recommends using Java version 11. Don’t forget to set the JAVA_HOME environment variable.
You need to install Apache Spark version 3 or higher. Don’t forget to set the SPARK_HOME environment variable.
You need to install Apache Hadoop version 3 or higher. Don’t forget to set the HADOOP_HOME environment variable.

Validating the installation

To execute your first application with Apache Wayang, you need to execute your program with the 'wayang-submit' command:

bin/wayang-submit org.apache.wayang.apps.wordcount.Main java file://$(pwd)/README.md

Getting Started

Wayang is available via Maven Central. To use it with Maven, include the following code snippet into your POM file:

<dependency>
  <groupId>org.apache.wayang</groupId>
  <artifactId>wayang-***</artifactId>
  <version>0.6.0</version>
</dependency>

Note the ***: Wayang ships with multiple modules that can be included in your app, depending on how you want to use it:

wayang-core: provides core data structures and the optimizer (required)
wayang-basic: provides common operators and data types for your apps (recommended)
wayang-api-scala-java_2.12: provides an easy-to-use Scala and Java API to assemble Wayang plans (recommended)
wayang-java, wayang-spark, wayang-graphchi, wayang-sqlite3, wayang-postgres: adapters for the various supported processing platforms
wayang-profiler: provides functionality to learn operator and UDF cost functions from historical execution data

NOTE: The module wayang-api-scala-java_2.12 is intended to be used with Java 11 and Scala 2.12. If you have the Java 8 version, you need to use the wayang-api-scala-java_2.11 module.

For the sake of version flexibility, you still have to include in the POM file your Hadoop (hadoop-hdfs and hadoop-common) and Spark (spark-core and spark-graphx) version of choice.

In addition, you can obtain the most recent snapshot version of Wayang via Sonatype's snapshot repository. Just include:

<repositories>
  <repository>
    <id>apache-snapshots</id>
    <name>Apache Foundation Snapshot Repository</name>
    <url>https://repository.apache.org/content/repositories/snapshots</url>
  </repository>
</repositories>

Prerequisites

Apache Wayang (incubating) is built with Java 11 and Scala 2.12. However, to run Apache Wayang it is sufficient to have just Java 11 installed. Please also consider that processing platforms employed by Wayang might have further requirements.

Java 11
[Scala 2.12]

NOTE: Wayang also works with Java 8 and Scala 2.11. If you want to use these versions, you will have to re-build Wayang (see below).

NOTE: In windows, you need to define the variable HADOOP_HOME with the winutils.exe, an not official option to obtain this repository, or you can generate your winutils.exe following the instructions in the repository. Also, you may need to install msvcr100.dll

NOTE: Make sure that the JAVA_HOME environment variable is set correctly to either Java 8 or Java 11 as the prerequisite checker script currently supports up to Java 11 and checks the latest version of Java if you have higher version installed. In Linux, it is preferably to use the export JAVA_HOME method inside the project folder. It is also recommended running './mvnw clean install' before opening the project using IntelliJ.

Building

If you need to rebuild Wayang, e.g., to use a different Scala version, you can simply do so via Maven:

Adapt the version variables (e.g., spark.version) in the main pom.xml file.

Build Wayang with the adapted versions.

git clone https://github.com/apache/incubator-wayang.git
cd incubator-wayang
./mvnw clean install -DskipTests

NOTE: If you receive an error about not finding MathExBaseVisitor, then the problem might be that you are trying to build from IntelliJ, without Maven. MathExBaseVisitor is generated code, and a Maven build should generate it automatically.

NOTE: In the current Maven setup, the version of scala is tied to the Java version, you can compile the profile scala-11 with Java 8 and profile scala-12 with Java 11.

NOTE: For compiling and testing the code it is required to have Hadoop installed on your machine.

NOTE: the standalone profile to fix Hadoop and Spark versions, so that Wayang apps do not explicitly need to declare the corresponding dependencies.

Also, note the distro profile, which assembles a binary Wayang distribution. To activate these profiles, you need to specify them when running maven, i.e.,

./mvnw clean install -DskipTests -P<profile name>

Running the tests

In the incubator-wayang root folder run:

./mvnw test

Example Applications

WordCount

The "Hello World!" of data processing systems is the wordcount.

Java scala-like API

import org.apache.wayang.api.JavaPlanBuilder;
import org.apache.wayang.basic.data.Tuple2;
import org.apache.wayang.core.api.Configuration;
import org.apache.wayang.core.api.WayangContext;
import org.apache.wayang.core.optimizer.cardinality.DefaultCardinalityEstimator;
import org.apache.wayang.java.Java;
import org.apache.wayang.spark.Spark;
import java.util.Collection;
import java.util.Arrays;

public class WordcountJava {

    public static void main(String[] args){

        // Settings
        String inputUrl = "file:/tmp.txt";

        // Get a plan builder.
        WayangContext wayangContext = new WayangContext(new Configuration())
                .withPlugin(Java.basicPlugin())
                .withPlugin(Spark.basicPlugin());
        JavaPlanBuilder planBuilder = new JavaPlanBuilder(wayangContext)
                .withJobName(String.format("WordCount (%s)", inputUrl))
                .withUdfJarOf(WordcountJava.class);

        // Start building the WayangPlan.
        Collection<Tuple2<String, Integer>> wordcounts = planBuilder
                // Read the text file.
                .readTextFile(inputUrl).withName("Load file")

                // Split each line by non-word characters.
                .flatMap(line -> Arrays.asList(line.split("\\W+")))
                .withSelectivity(10, 100, 0.9)
                .withName("Split words")

                // Filter empty tokens.
                .filter(token -> !token.isEmpty())
                .withSelectivity(0.99, 0.99, 0.99)
                .withName("Filter empty words")

                // Attach counter to each word.
                .map(word -> new Tuple2<>(word.toLowerCase(), 1)).withName("To lower case, add counter")

                // Sum up counters for every word.
                .reduceByKey(
                        Tuple2::getField0,
                        (t1, t2) -> new Tuple2<>(t1.getField0(), t1.getField1() + t2.getField1())
                )
                .withCardinalityEstimator(new DefaultCardinalityEstimator(0.9, 1, false, in -> Math.round(0.01 * in[0])))
                .withName("Add counters")

                // Execute the plan and collect the results.
                .collect();

        System.out.println(wordcounts);
    }
}

Scala API

import org.apache.wayang.api._
import org.apache.wayang.core.api.{Configuration, WayangContext}
import org.apache.wayang.java.Java
import org.apache.wayang.spark.Spark

object WordcountScala {
  def main(args: Array[String]) {

    // Settings
    val inputUrl = "file:/tmp.txt"

    // Get a plan builder.
    val wayangContext = new WayangContext(new Configuration)
      .withPlugin(Java.basicPlugin)
      .withPlugin(Spark.basicPlugin)
    val planBuilder = new PlanBuilder(wayangContext)
      .withJobName(s"WordCount ($inputUrl)")
      .withUdfJarsOf(this.getClass)

    val wordcounts = planBuilder
      // Read the text file.
      .readTextFile(inputUrl).withName("Load file")

      // Split each line by non-word characters.
      .flatMap(_.split("\\W+"), selectivity = 10).withName("Split words")

      // Filter empty tokens.
      .filter(_.nonEmpty, selectivity = 0.99).withName("Filter empty words")

      // Attach counter to each word.
      .map(word => (word.toLowerCase, 1)).withName("To lower case, add counter")

      // Sum up counters for every word.
      .reduceByKey(_._1, (c1, c2) => (c1._1, c1._2 + c2._2)).withName("Add counters")
      .withCardinalityEstimator((in: Long) => math.round(in * 0.01))

      // Execute the plan and collect the results.
      .collect()

    println(wordcounts)
  }
}

k-means

Wayang is also capable of iterative processing, which is, e.g., very important for machine learning algorithms, such as k-means.

Scala API

import org.apache.wayang.api._
import org.apache.wayang.core.api.{Configuration, WayangContext}
import org.apache.wayang.core.function.FunctionDescriptor.ExtendedSerializableFunction
import org.apache.wayang.core.function.ExecutionContext
import org.apache.wayang.core.optimizer.costs.LoadProfileEstimators
import org.apache.wayang.java.Java
import org.apache.wayang.spark.Spark

import scala.util.Random
import scala.collection.JavaConversions._

object kmeans {
  def main(args: Array[String]) {

    // Settings
    val inputUrl = "file:/kmeans.txt"
    val k = 5
    val iterations = 100
    val configuration = new Configuration

    // Get a plan builder.
    val wayangContext = new WayangContext(new Configuration)
      .withPlugin(Java.basicPlugin)
      .withPlugin(Spark.basicPlugin)
    val planBuilder = new PlanBuilder(wayangContext)
      .withJobName(s"k-means ($inputUrl, k=$k, $iterations iterations)")
      .withUdfJarsOf(this.getClass)

    case class Point(x: Double, y: Double)
    case class TaggedPoint(x: Double, y: Double, cluster: Int)
    case class TaggedPointCounter(x: Double, y: Double, cluster: Int, count: Long) {
      def add_points(that: TaggedPointCounter) = TaggedPointCounter(this.x + that.x, this.y + that.y, this.cluster, this.count + that.count)
      def average = TaggedPointCounter(x / count, y / count, cluster, 0)
    }

    // Read and parse the input file(s).
    val points = planBuilder
      .readTextFile(inputUrl).withName("Read file")
      .map { line =>
        val fields = line.split(",")
        Point(fields(0).toDouble, fields(1).toDouble)
      }.withName("Create points")


    // Create initial centroids.
    val random = new Random
    val initialCentroids = planBuilder
      .loadCollection(for (i <- 1 to k) yield TaggedPointCounter(random.nextGaussian(), random.nextGaussian(), i, 0)).withName("Load random centroids")

    // Declare UDF to select centroid for each data point.
    class SelectNearestCentroid extends ExtendedSerializableFunction[Point, TaggedPointCounter] {

      /** Keeps the broadcasted centroids. */
      var centroids: Iterable[TaggedPointCounter] = _

      override def open(executionCtx: ExecutionContext) = {
        centroids = executionCtx.getBroadcast[TaggedPointCounter]("centroids")
      }

      override def apply(point: Point): TaggedPointCounter = {
        var minDistance = Double.PositiveInfinity
        var nearestCentroidId = -1
        for (centroid <- centroids) {
          val distance = Math.pow(Math.pow(point.x - centroid.x, 2) + Math.pow(point.y - centroid.y, 2), 0.5)
          if (distance < minDistance) {
            minDistance = distance
            nearestCentroidId = centroid.cluster
          }
        }
        new TaggedPointCounter(point.x, point.y, nearestCentroidId, 1)
      }
    }

    // Do the k-means loop.
    val finalCentroids = initialCentroids.repeat(iterations, { currentCentroids =>
      points
        .mapJava(new SelectNearestCentroid,
          udfLoad = LoadProfileEstimators.createFromSpecification(
            "my.udf.costfunction.key", configuration
          ))
        .withBroadcast(currentCentroids, "centroids").withName("Find nearest centroid")
        .reduceByKey(_.cluster, _.add_points(_)).withName("Add up points")
        .withCardinalityEstimator(k)
        .map(_.average).withName("Average points")
    }).withName("Loop")

      // Collect the results.
      .collect()

    println(finalCentroids)
  }
}

Built With

Contributing

As a contributor, you can help shape the future of the project by providing feedback, joining our mailing lists, reporting bugs, requesting features, and participating in discussions. As you become more involved, you can also help with development by providing patches for bug fixes or features and helping to improve our documentation.

If you show sustained commitment to the project, you may be invited to become a committer. This brings with it the privilege of write access to the project repository and resources.

To learn more about how to get involved with the Apache Wayang project, please visit our “Get Involved” page and read the Apache code of conduct. We look forward to your contributions!

Authors

The list of contributors.

License

All files in this repository are licensed under the Apache Software License 2.0

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Acknowledgements

The Logo was donated by Brian Vera.

Name		Name	Last commit message	Last commit date
Latest commit History 1,774 Commits
.github		.github
.mvn/wrapper		.mvn/wrapper
bin		bin
build		build
conf		conf
guides		guides
images		images
python		python
src/main/script		src/main/script
tools/test/config		tools/test/config
wayang-api		wayang-api
wayang-assembly		wayang-assembly
wayang-benchmark		wayang-benchmark
wayang-commons		wayang-commons
wayang-docs		wayang-docs
wayang-platforms		wayang-platforms
wayang-plugins		wayang-plugins
wayang-profiler		wayang-profiler
wayang-resources		wayang-resources
wayang-tests-integration		wayang-tests-integration
wayang-web-dashboard		wayang-web-dashboard
.asf.yaml		.asf.yaml
.dlc.json		.dlc.json
.gitignore		.gitignore
.gitmodules		.gitmodules
.licenserc.yaml		.licenserc.yaml
DISCLAIMER		DISCLAIMER
Jenkinsfile		Jenkinsfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
RELEASE_NOTES		RELEASE_NOTES
build.md		build.md
general-todos.md		general-todos.md
jenkins.pom		jenkins.pom
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Apache Wayang (incubating)

The first open-source cross-platform data processing system

Description

Quick Guide for Running Wayang

Quick Guide for Developing with Wayang

Installing Wayang

Requirements at Runtime

Validating the installation

Getting Started

Prerequisites

Building

Running the tests

Example Applications

WordCount

Java scala-like API

Scala API

k-means

Scala API

Built With

Contributing

Authors

License

Acknowledgements

About

Releases

Packages

Languages

License

kbeedkar/incubator-wayang

Folders and files

Latest commit

History

Repository files navigation

Apache Wayang (incubating)

The first open-source cross-platform data processing system

Description

Quick Guide for Running Wayang

Quick Guide for Developing with Wayang

Installing Wayang

Requirements at Runtime

Validating the installation

Getting Started

Prerequisites

Building

Running the tests

Example Applications

WordCount

Java scala-like API

Scala API

k-means

Scala API

Built With

Contributing

Authors

License

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages