Skip to content

Release 0.14.0

Compare
Choose a tag to compare
@zoyahav zoyahav released this 05 Aug 16:46
· 716 commits to master since this release

Major Features and Improvements

  • New tft.word_count mapper to identify the number of tokens for each row
    (for pre-tokenized strings).
  • All tft.scale_to_* mappers now have per-key variants, along with analyzers
    for mean_and_var_per_key and min_and_max_per_key.
  • New tft_beam.AnalyzeDatasetWithCache allows analyzing ranges of data while
    producing and utilizing cache. tft.analyzer_cache can help read and write
    such cache to a filesystem between runs. This caching feature is worth using
    when analyzing a rolling range in a continuous pipeline manner. This is an
    experimental feature.
  • Added reduce_instance_dims support to tft.quantiles and elementwise to
    tft.bucketize, while avoiding separate beam calls for each feature.

Bug Fixes and Other Changes

  • sparse_tensor_to_dense_with_shape now accepts an optional default_value
    parameter.
  • tft.vocabulary and tft.compute_and_apply_vocabulary now support
    fingerprint_shuffle to sort the vocabularies by fingerprint instead of
    counts. This is useful for load balancing the training parameter servers.
    This is an experimental feature.
  • Fix numerical instability in tft.vocabulary mutual information calculations.
  • tft.vocabulary and tft.compute_and_apply_vocabulary now support computing
    vocabularies over integer categoricals and multivalent input features, and
    computing mutual information for non-binary labels.
  • New numeric normalization method available:
    tft.apply_buckets_with_interpolation.
  • Changes to make this library more compatible with TensorFlow 2.0.
  • Fix sanitizing of vocabulary filenames.
  • Emit a friendly error message when context isn't set.
  • Analyzer output dtypes are enforced to be TensorFlow dtypes, and by extension
    ptransform_analyzer's output_dtypes is enforced to be a list of TensorFlow
    dtypes.
  • Make tft.apply_buckets_with_interpolation support SparseTensors.
  • Adds an experimental api for analyzers to annotate the post-transform schema.
  • TFTransformOutput.transform_raw_features now accepts an optional
    drop_unused_features parameter to exclude unused features in output.
  • If not specified, the min_diff_from_avg parameter of tft.vocabulary now
    defaults to a reasonable value based on the size of the dataset (relevant
    only if computing vocabularies using mutual information).
  • Convert some tf.contrib functions to be compatible with TF2.0.
  • New tft.bag_of_words mapper to compute the unique set of ngrams for each row
    (for pre-tokenized strings).
  • Fixed a bug in tf_utils.reduce_batch_count_mean_and_var, and as a result
    mean_and_var analyzer, was miscalculating variance for the sparse
    elementwise=True case.
  • At test utility tft_unit.cross_named_parameters for creating parameterized
    tests that involve the cartesian product of various parameters.
  • Depends on tensorflow-metadata>=0.14,<0.15.
  • Depends on apache-beam[gcp]>=2.14,<3.
  • Depends on numpy>=1.16,<2.
  • Depends on absl-py>=0.7,<2.
  • Allow preprocessing_fn to emit a tf.RaggedTensor. In this case, the
    output Schema proto will not be able to be converted to a feature spec,
    and so the output data will not be able to be materialized with tft.coders.
  • Ability to directly set exact num_buckets with new parameter
    always_return_num_quantiles for analyzers.quantiles and
    mappers.bucketize, defaulting to False in general but True when
    reduce_instance_dims is False.

Breaking changes

  • tf_utils.reduce_batch_count_mean_and_var, which feeds into
    tft.mean_and_var, now returns 0 instead of inf for empty columns of a
    sparse tensor.
  • tensorflow_transform.tf_metadata.dataset_schema.Schema class is removed.
    Wherever a dataset_schema.Schema was used, users should now provide a
    tensorflow_metadata.proto.v0.schema_pb2.Schema proto. For backwards
    compatibility, dataset_schema.Schema is now a factory method that produces
    a Schema proto. Updating code should be straightforward because the
    dataset_schema.Schema class was already a wrapper around the Schema proto.
  • Only explicitly public analyzers are exported to the tft module, e.g.
    combiners are no longer exported and have to be accessed directly through
    tft.analyzers.
  • Requires pre-installed TensorFlow >=1.14,<2.

Deprecations

  • DatasetSchema is now a deprecated factory method (see above).
  • tft.tf_metadata.dataset_schema.from_feature_spec is now deprecated.
    Equivalent functionality is provided by
    tft.tf_metadata.schema_utils.schema_from_feature_spec.