You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: module_3/README.md
+20-1Lines changed: 20 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -33,6 +33,7 @@ This is a very similar module to module 1. The key difference is now we'll be us
33
33
-[Step 7: Retrieve features + test stream ingestion](#step-7-retrieve-features--test-stream-ingestion)
34
34
-[Overview](#overview)
35
35
-[Time to run code!](#time-to-run-code)
36
+
-[Step 8: Options for orchestrating streaming pipelines](#step-8-options-for-orchestrating-streaming-pipelines)
36
37
-[Conclusion](#conclusion)
37
38
-[Limitations](#limitations)
38
39
-[Why Feast?](#why-feast)
@@ -251,6 +252,24 @@ Feast will help enforce a consistent schema across batch + streaming features as
251
252
### Time to run code!
252
253
Now, Run [Jupyter notebook](feature_repo/module_3.ipynb)
253
254
255
+
## Step 8: Options for orchestrating streaming pipelines
256
+
We don't showcase how this works, but broadly there are many approaches to this. In all the approaches, you'll likely want to generate operational metrics for monitoring (e.g. via StatsD or Prometheus Pushgateway).
257
+
258
+
To outline a few approaches:
259
+
- **Option 1**: frequently run stream ingestion on a trigger, and then run this in the orchestration tool of choice like Airflow, Databricks Jobs, etc. e.g.
- **Option 2**: with Databricks, use Databricks Jobs to monitor streaming queries and auto-retry on a new cluster + on failure. See [Databricks docs](https://docs.databricks.com/structured-streaming/query-recovery.html#configure-structured-streaming-jobs-to-restart-streaming-queries-on-failure) for details.
270
+
- **Option 3**: with Dataproc, configure [restartable jobs](https://cloud.google.com/dataproc/docs/concepts/jobs/restartable-jobs)
271
+
- **Option 4** If you're using Flink, then consider configuring a [restart strategy](https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/state/task_failure_recovery/)
272
+
254
273
# Conclusion
255
274
By the end of this module, you will have learned how to build a full feature platform, with orchestrated batch transformations (using dbt + Airflow), orchestrated materialization (with Feast + Airflow).
256
275
@@ -282,4 +301,4 @@ Several things change:
282
301
- Production deployment of Airflow (e.g. syncing with a Git repository of DAGs, using k8s)
283
302
- Bundling dbt models with Airflow (e.g. via S3 like this [MWAA + dbt guide](https://docs.aws.amazon.com/mwaa/latest/userguide/samples-dbt.html))
284
303
- Airflow DAG parallelizes across feature views (instead of running a single `feature_store.materialize` across all feature views)
285
-
- Feast materialization is configured to be more scalable (e.g. using other Feast batch materialization engines [Bytewax](https://docs.feast.dev/reference/batch-materialization/bytewax), [Snowflake](https://docs.feast.dev/reference/batch-materialization/snowflake), [Lambda](https://docs.feast.dev/reference/batch-materialization/lambda), [Spark](https://docs.feast.dev/reference/batch-materialization/spark))
304
+
- Feast materialization is configured to be more scalable (e.g. using other Feast batch materialization engines [Bytewax](https://docs.feast.dev/reference/batch-materialization/bytewax), [Snowflake](https://docs.feast.dev/reference/batch-materialization/snowflake), [Lambda](https://docs.feast.dev/reference/batch-materialization/lambda), [Spark](https://docs.feast.dev/reference/batch-materialization/spark))
0 commit comments