@@ -323,14 +323,19 @@ are `sqlalchemy.Table`, `pandas.DataFrame`, `polars.DataFrame`, or `polars.LazyF
323323
324324### Controlling automatic cache invalidation
325325
326- For input_type ` sa.Table ` , and ` pdt.SqlAlchemy ` , in general, it is best to set lazy=True. This means the task is always
326+ For input_type ` sa.Table ` , and ` pdt.SqlAlchemy ` , in general, it is best to set ` lazy=True ` . This means the task is always
327327executed because producing a query is fast, but the query is only executed when it is actually needed. For
328328` pl.LazyFrame ` , ` version=AUTO_VERSION ` is a good choice, because then the task is executed once with empty input
329- dataframes and only if resulting LazyFrame expressions change, the task is executed again with full input data. For
330- ` pd.DataFrame ` and ` pl.DataFrame ` , we don't try to guess which changes of the code are actually meaningful. Thus the
331- user needs to help manually bumpig a version number like ` version="1.0.0" ` . For development, ` version=None ` simply
332- deactivates caching until the code is more stable. It is recommended to always develop with small pipeline instances
333- anyways to achieve high iteration speed (see [ multi_instance_pipeline.md] ( multi_instance_pipeline.md ) ).
329+ dataframes and only if resulting LazyFrame expressions change, the task is executed again with full input data.
330+
331+ For ` pd.DataFrame ` and ` pl.DataFrame ` , we don't try to guess which changes of the code are actually meaningful. Thus,
332+ to avoid running the task, the user needs to help manually bumping a version number like ` version="1.0.0" ` .
333+ For development, ` version=None ` simply deactivates caching until the code is more stable. It is recommended to always
334+ develop with small pipeline instances anyways to achieve high iteration speed (see [ multi_instance_pipeline.md] ( multi_instance_pipeline.md ) ).
335+ Setting ` lazy=True ` for tasks returning ` pd.DataFrame ` or ` pl.DataFrame ` objects, always executes the task, but hashes the result to
336+ determine the cache-validity of the task output and hence the cache invalidation of downstream tasks.
337+ This is a good choice for tasks returning small dataframes which are quick to compute and where bumping the version number adds unwanted
338+ complexity to the development process. It is allowed to produce both dataframe and SQL output in one ` @materialize(lazy=True, ...) ` task.
334339
335340### Integration with pydiverse colspec (same as dataframely but with pydiverse transform based SQL support)
336341
0 commit comments