@@ -109,14 +109,40 @@ def materialize(
109109 :param lazy:
110110 Whether this task is lazy or not.
111111
112- Unlike a normal task, lazy tasks always get executed. However, if a lazy
113- task produces a lazy table (e.g. a SQL query), the table store checks if
114- the same query has been executed before. If this is the case, then the
115- query doesn't get executed, and instead, the table gets copied from the cache.
112+ Unlike a normal task, lazy tasks always get executed. However, before table
113+ returned by a lazy task gets materialized, the table store checks if
114+ the same table has been materialized before. If this is the case, then the
115+ table doesn't get materialized, and instead, the table gets copied from the cache.
116+
117+ This is efficient for tasks that return SQL queries, because the query
118+ only gets generated but will not be executed again if the resulting table is cache-valid.
119+
120+ The same also works for :py:class:`ExternalTableReference <pydiverse.pipedag.container.ExternalTableReference>`,
121+ where the "query" is just the identifier of the table in the store.
122+
123+ .. Note:: For tasks returning an ``ExternalTableReference`` pipedag cannot automatically
124+ know if the external tables has changed of not. This should be controlled via a cache function
125+ given via the ``cache`` argument of ``materialize``.
126+ See :py:class:`ExternalTableReference <pydiverse.pipedag.container.ExternalTableReference>`
127+ for an example.
128+
129+
130+ For tasks returning a Polars DataFrame, the output is deemed cache-valid
131+ if the hash of the resulting DataFrame is the same as the hash of the previous run.
132+ So, even though the task always gets executed, downstream tasks can remain cache-valid
133+ if the DataFrame is the same as before. This is useful for small tasks that are hard to
134+ implement using only LazyFrames, but where the DataFrame generation is cheap.
135+
136+
137+
138+ In both cases, you don't need to manually bump the ``version`` of a lazy task.
139+
140+ .. Warning:: A task returning a Polars LazyFrame should `not` be marked as lazy.
141+ Use ``version=AUTO_VERSION`` instead. See :py:class:`AUTO_VERSION`.
142+ .. Warning:: A task returning a Pandas DataFrame should `not` be marked as lazy.
143+ No hashing is implemented for Pandas DataFrames, so the task will always
144+ be deemed cache-invalid, and thus, cache-invalidate all downstream tasks.
116145
117- This behaviour is very useful, because you don't need to manually bump
118- the `version` of a lazy task. This only works because for lazy tables
119- generating the query is very cheap compared to executing it.
120146 :param group_node_tag:
121147 Set a tag that may add this task to a configuration based group node.
122148 :param nout:
0 commit comments