This is a generic Spark Streaming application designed to read from a data source, process the data, and write it to a sink. The application is built to be configurable and extensible, allowing for different readers, processors, and writers to be plugged in via configuration.
The project follows a standard Scala project layout:
src
├── main
│ ├── resources
│ │ └── config/application_delta2.conf # Application configuration
│ └── scala
│ ├── Main.scala # Main entry point for the application
│ ├── domain/ # Case classes for data models (e.g., RawUserEvent, UserEngagementResult)
│ ├── io/ # Data readers (sources) and writers (sinks)
│ ├── usecase/ # Business logic and stream processing
│ └── readSample.scala # Sample scala file
└── test
└── scala
Main.scala: The entry point of the application. It initializes Spark, reads the configuration, and wires together the reader, processor, and writer.DataReader: Responsible for reading data from a source (e.g., Kafka, Kinesis, files).StreamProcessor: Contains the core business logic for transforming the input data stream.DataWriter: Responsible for writing the processed data to a sink (e.g., Delta Lake, console).
The application is configured using HOCON (Human-Optimized Config Object Notation) format. The main configuration file is located at src/main/resources/config/application_delta2.conf.
This file defines:
- The job name.
- The fully qualified class names for the
DataReader,StreamProcessor, andDataWriterto be used. - Specific configurations for the reader, processor, and writer components.
-
Build the project:
mvn clean package
-
Submit the Spark job: Use
spark-submitto run the application. You will need to provide the application JAR, and any other dependencies.spark-submit --class Main \ --master local[*] \ target/your-project-jar.jar
To add a new data source, processing step, or sink:
- Implement the corresponding
DataReader,StreamProcessor, orDataWritertrait. - Update the
application.conffile to point to your new implementation.
This project is designed around a set of core abstractions (DataReader, DataWriter, StreamProcessor) that allow for easy extension and customization. Below is a guide on how to implement each of these components.
The DataReader trait is responsible for reading data from an external source and returning it as a Dataset[T].
Trait Definition:
package io
import com.typesafe.config.Config
import org.apache.spark.sql.{Dataset, SparkSession}
import scala.reflect.runtime.universe.TypeTag
trait DataReader[T] {
def read(config: Config)(implicit spark: SparkSession, tt: TypeTag[T]): Dataset[T]
}Implementation Example: JsonReader
The JsonReader implementation reads data from a directory of JSON files.
package io.source
import com.typesafe.config.Config
import domain.RawUserEvent
import io.DataReader
import org.apache.spark.sql.{Dataset, SparkSession}
class JsonReader extends DataReader[RawUserEvent] {
override def read(config: Config)(implicit spark: SparkSession, tt: TypeTag[RawUserEvent]): Dataset[RawUserEvent] = {
import spark.implicits._
val inputPath = config.getString("options.path")
spark.readStream
.schema(Encoders.product[RawUserEvent].schema)
.format("json")
.load(inputPath)
.as[RawUserEvent]
}
}To implement your own DataReader:
- Create a class that extends
DataReader[YourCaseClass]. - Implement the
readmethod to connect to your data source (e.g., Kafka, Kinesis) and transform the data into aDatasetof your case class. - Use the provided
Configobject to fetch any necessary configurations, such as file paths, server addresses, or topic names.
The StreamProcessor trait is where the core business logic of your streaming application resides. It takes an input Dataset[IN] and transforms it into an output Dataset[OUT].
Trait Definition:
package usecase.contract
import com.typesafe.config.Config
import org.apache.spark.sql.{Dataset, SparkSession}
trait StreamProcessor[IN, OUT] {
def process(input: Dataset[IN], config: Config)(implicit spark: SparkSession): Dataset[OUT]
}Implementation Example: UserEngagementProcessor
This processor calculates user engagement metrics (e.g., clicks, purchases) over a sliding window.
package usecase
import domain.{RawUserEvent, UserEngagementResult}
import org.apache.spark.sql.functions.window
import org.apache.spark.sql.{Dataset, SparkSession}
import usecase.contract.StreamProcessor
class UserEngagementProcessor extends StreamProcessor[RawUserEvent, UserEngagementResult] {
override def process(input: Dataset[RawUserEvent], config: Config)(implicit spark: SparkSession): Dataset[UserEngagementResult] = {
import spark.implicits._
val windowDuration = s"${config.getInt("settings.windowDurationSeconds")} seconds"
input
.withWatermark("eventTimestamp", "10 seconds")
.groupBy(window($"eventTimestamp", windowDuration), $"userId")
.agg(
// ... aggregation logic ...
)
.as[UserEngagementResult]
}
}To implement your own StreamProcessor:
- Define your input and output case classes (e.g.,
MyInput,MyOutput). - Create a class that extends
StreamProcessor[MyInput, MyOutput]. - Implement the
processmethod to apply your business logic using Spark's Dataset API.
The DataWriter trait is responsible for writing a Dataset[T] to a sink.
Trait Definition:
package io
import com.typesafe.config.Config
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.streaming.DataStreamWriter
trait DataWriter[T] {
def write(data: Dataset[T], config: Config): DataStreamWriter[T]
}Implementation Example: DeltaWriter
The DeltaWriter writes the stream to a Delta Lake table.
package io.sink
import com.typesafe.config.Config
import io.DataWriter
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.streaming.DataStreamWriter
class DeltaWriter[T] extends DataWriter[T] {
override def write(data: Dataset[T], config: Config): DataStreamWriter[T] = {
val outputPath = config.getString("options.path")
val checkpointLocation = config.getString("checkpointLocation")
data.writeStream
.format("delta")
.outputMode(config.getString("outputMode"))
.option("checkpointLocation", checkpointLocation)
.option("path", outputPath)
}
}To implement your own DataWriter:
- Create a class that extends
DataWriter[YourCaseClass]. - Implement the
writemethod to configure theDataStreamWriterfor your desired sink (e.g., console, Kafka, another database). - Use the
Configobject to get parameters like output paths, checkpoint locations, and output modes.
Here is a quick template to follow when adding a new end-to-end streaming pipeline.
-
Define Data Models:
- Create your input and output case classes in the
domainpackage (e.g.,MyInput.scala,MyOutput.scala).
- Create your input and output case classes in the
-
Implement
DataReader:- Create a new class under
io.sourcethat extendsDataReader[MyInput]. - Implement the
readmethod to ingest data from your source.
- Create a new class under
-
Implement
StreamProcessor:- Create a new class under
usecasethat extendsStreamProcessor[MyInput, MyOutput]. - Implement the
processmethod to define your business logic.
- Create a new class under
-
Implement
DataWriter:- Create a new class under
io.sinkthat extendsDataWriter[MyOutput]. - Implement the
writemethod to send data to your sink.
- Create a new class under
-
Update Configuration:
- In your configuration file (e.g.,
application_delta2.conf), update the class paths to point to your new implementations and provide any necessary options.
job { name = "MyNewStream" reader { class = "io.source.MyNewReader" // ... reader-specific options } processor { class = "usecase.MyNewProcessor" // ... processor-specific options } writer { class = "io.sink.MyNewWriter" // ... writer-specific options } }
- In your configuration file (e.g.,