Release Release 0.14.0 · tensorflow/transform

Major Features and Improvements

New tft.word_count mapper to identify the number of tokens for each row
(for pre-tokenized strings).
All tft.scale_to_* mappers now have per-key variants, along with analyzers
for mean_and_var_per_key and min_and_max_per_key.
New tft_beam.AnalyzeDatasetWithCache allows analyzing ranges of data while
producing and utilizing cache. tft.analyzer_cache can help read and write
such cache to a filesystem between runs. This caching feature is worth using
when analyzing a rolling range in a continuous pipeline manner. This is an
experimental feature.
Added reduce_instance_dims support to tft.quantiles and elementwise to
tft.bucketize, while avoiding separate beam calls for each feature.

Bug Fixes and Other Changes

sparse_tensor_to_dense_with_shape now accepts an optional default_value
parameter.
tft.vocabulary and tft.compute_and_apply_vocabulary now support
fingerprint_shuffle to sort the vocabularies by fingerprint instead of
counts. This is useful for load balancing the training parameter servers.
This is an experimental feature.
Fix numerical instability in tft.vocabulary mutual information calculations.
tft.vocabulary and tft.compute_and_apply_vocabulary now support computing
vocabularies over integer categoricals and multivalent input features, and
computing mutual information for non-binary labels.
New numeric normalization method available:
tft.apply_buckets_with_interpolation.
Changes to make this library more compatible with TensorFlow 2.0.
Fix sanitizing of vocabulary filenames.
Emit a friendly error message when context isn't set.
Analyzer output dtypes are enforced to be TensorFlow dtypes, and by extension
ptransform_analyzer's output_dtypes is enforced to be a list of TensorFlow
dtypes.
Make tft.apply_buckets_with_interpolation support SparseTensors.
Adds an experimental api for analyzers to annotate the post-transform schema.
TFTransformOutput.transform_raw_features now accepts an optional
drop_unused_features parameter to exclude unused features in output.
If not specified, the min_diff_from_avg parameter of tft.vocabulary now
defaults to a reasonable value based on the size of the dataset (relevant
only if computing vocabularies using mutual information).
Convert some tf.contrib functions to be compatible with TF2.0.
New tft.bag_of_words mapper to compute the unique set of ngrams for each row
(for pre-tokenized strings).
Fixed a bug in tf_utils.reduce_batch_count_mean_and_var, and as a result
mean_and_var analyzer, was miscalculating variance for the sparse
elementwise=True case.
At test utility tft_unit.cross_named_parameters for creating parameterized
tests that involve the cartesian product of various parameters.
Depends on tensorflow-metadata>=0.14,<0.15.
Depends on apache-beam[gcp]>=2.14,<3.
Depends on numpy>=1.16,<2.
Depends on absl-py>=0.7,<2.
Allow preprocessing_fn to emit a tf.RaggedTensor. In this case, the
output Schema proto will not be able to be converted to a feature spec,
and so the output data will not be able to be materialized with tft.coders.
Ability to directly set exact num_buckets with new parameter
always_return_num_quantiles for analyzers.quantiles and
mappers.bucketize, defaulting to False in general but True when
reduce_instance_dims is False.

Breaking changes

tf_utils.reduce_batch_count_mean_and_var, which feeds into
tft.mean_and_var, now returns 0 instead of inf for empty columns of a
sparse tensor.
tensorflow_transform.tf_metadata.dataset_schema.Schema class is removed.
Wherever a dataset_schema.Schema was used, users should now provide a
tensorflow_metadata.proto.v0.schema_pb2.Schema proto. For backwards
compatibility, dataset_schema.Schema is now a factory method that produces
a Schema proto. Updating code should be straightforward because the
dataset_schema.Schema class was already a wrapper around the Schema proto.
Only explicitly public analyzers are exported to the tft module, e.g.
combiners are no longer exported and have to be accessed directly through
tft.analyzers.
Requires pre-installed TensorFlow >=1.14,<2.

Deprecations

DatasetSchema is now a deprecated factory method (see above).
tft.tf_metadata.dataset_schema.from_feature_spec is now deprecated.
Equivalent functionality is provided by
tft.tf_metadata.schema_utils.schema_from_feature_spec.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release 0.14.0

Major Features and Improvements

Bug Fixes and Other Changes

Breaking changes

Deprecations

Uh oh!