Add stream orchestration notes

adchia · adchia · commit 0b8ce3ea73ee · 2022-10-25T00:30:46.000-07:00
Signed-off-by: Danny Chiao &lt;danny@tecton.ai&gt;
diff --git a/module_3/README.md b/module_3/README.md
@@ -33,6 +33,7 @@ This is a very similar module to module 1. The key difference is now we'll be us
   - [Step 7: Retrieve features + test stream ingestion](#step-7-retrieve-features--test-stream-ingestion)
     - [Overview](#overview)
     - [Time to run code!](#time-to-run-code)
+  - [Step 8: Options for orchestrating streaming pipelines](#step-8-options-for-orchestrating-streaming-pipelines)
 - [Conclusion](#conclusion)
   - [Limitations](#limitations)
   - [Why Feast?](#why-feast)
@@ -251,6 +252,24 @@ Feast will help enforce a consistent schema across batch + streaming features as
 ### Time to run code!
 Now, Run [Jupyter notebook](feature_repo/module_3.ipynb)
 
+## Step 8: Options for orchestrating streaming pipelines
+We don't showcase how this works, but broadly there are many approaches to this. In all the approaches, you'll likely want to generate operational metrics for monitoring (e.g. via StatsD or Prometheus Pushgateway).
+
+To outline a few approaches:
+  - **Option 1**: frequently run stream ingestion on a trigger, and then run this in the orchestration tool of choice like Airflow, Databricks Jobs, etc. e.g. 
+    ```python
+    (seven_day_avg
+        .writeStream
+        .outputMode("append") 
+        .option("checkpointLocation", "/tmp/feast-workshop/q3/")
+        .trigger(once=True)
+        .foreachBatch(send_to_feast)
+        .start())
+    ```
+  - **Option 2**: with Databricks, use Databricks Jobs to monitor streaming queries and auto-retry on a new cluster + on failure. See [Databricks docs](https://docs.databricks.com/structured-streaming/query-recovery.html#configure-structured-streaming-jobs-to-restart-streaming-queries-on-failure) for details.
+  - **Option 3**: with Dataproc, configure [restartable jobs](https://cloud.google.com/dataproc/docs/concepts/jobs/restartable-jobs)
+  - **Option 4** If you're using Flink, then consider configuring a [restart strategy](https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/task_failure_recovery/)
+
 # Conclusion
 By the end of this module, you will have learned how to build a full feature platform, with orchestrated batch transformations (using dbt + Airflow), orchestrated materialization (with Feast + Airflow).
 
@@ -282,4 +301,4 @@ Several things change:
 - Production deployment of Airflow (e.g. syncing with a Git repository of DAGs, using k8s)
 - Bundling dbt models with Airflow (e.g. via S3 like this [MWAA + dbt guide](https://docs.aws.amazon.com/mwaa/latest/userguide/samples-dbt.html))
 - Airflow DAG parallelizes across feature views (instead of running a single `feature_store.materialize` across all feature views)
-- Feast materialization is configured to be more scalable (e.g. using other Feast batch materialization engines [Bytewax](https://docs.feast.dev/reference/batch-materialization/bytewax), [Snowflake](https://docs.feast.dev/reference/batch-materialization/snowflake), [Lambda](https://docs.feast.dev/reference/batch-materialization/lambda), [Spark](https://docs.feast.dev/reference/batch-materialization/spark))
+- Feast materialization is configured to be more scalable (e.g. using other Feast batch materialization engines [Bytewax](https://docs.feast.dev/reference/batch-materialization/bytewax), [Snowflake](https://docs.feast.dev/reference/batch-materialization/snowflake), [Lambda](https://docs.feast.dev/reference/batch-materialization/lambda), [Spark](https://docs.feast.dev/reference/batch-materialization/spark))
diff --git a/module_3/feature_repo/module_3.ipynb b/module_3/feature_repo/module_3.ipynb
@@ -533,13 +533,15 @@
     "        .select(col(\"nameOrig\").alias(\"USER_ID\"), col(\"window.end\").alias(\"TIMESTAMP\"), \"7D_AVG_AMT\")\n",
     ")\n",
     "\n",
-    "query_1 = seven_day_avg \\\n",
-    "    .writeStream \\\n",
-    "    .outputMode(\"append\") \\\n",
-    "    .option(\"checkpointLocation\", \"/tmp/feast-workshop/q3/\") \\\n",
-    "    .trigger(processingTime=\"30 seconds\") \\\n",
-    "    .foreachBatch(send_to_feast) \\\n",
-    "    .start()\n",
+    "query_1 = (\n",
+    "    seven_day_avg\n",
+    "        .writeStream\n",
+    "        .outputMode(\"append\") \n",
+    "        .option(\"checkpointLocation\", \"/tmp/feast-workshop/q3/\")\n",
+    "        .trigger(processingTime=\"30 seconds\")\n",
+    "        .foreachBatch(send_to_feast)\n",
+    "        .start()\n",
+    ")\n",
     "\n",
     "query_1.awaitTermination(timeout=30)\n",
     "query_1.stop()"