Releases: tensorflow/transform
Releases · tensorflow/transform
Version 0.23.0
Major Features and Improvements
- Added
tft.scale_to_gaussianto transform input to standard gaussian. - Vocabulary related analyzers and mappers now accept a
file_formatargument
allowing the vocabulary to be saved in TFRecord format. The default format
remains text (TFRecord format requires tensorflow>=2.4).
Bug Fixes and Other Changes
- Enable
SavedModelLoaderto import and apply TF2 SavedModels. tft.min,tft.max,tft.sum,tft.covarianceandtft.pcanow have
default output values to properly process empty analysis datasets.tft.scale_by_min_max,tft.scale_to_0_1and the corresponding per-key
versions now apply a sigmoid function to scale tensors if the analysis
dataset is either empty or contains a single distinct value.- Added best-effort tf.text op registration when loading transformation
graphs. - Vocabularies computed over numerical features will now assign values to
entries with equal frequency in reverse lexicographical order as well,
similarly to string features. - Fixed an issue that causes the
TABLE_INITIALIZERSgraph collection to
contain a tensor instead of an op when a TF2 SavedModel or a TF2 Hub Module
containing a table is loaded inside thepreprocessing_fn. - Fixes an issue where the output tensors of
tft.TransformFeaturesLayer
would all have unknown shapes. - Stopped depending on
avro-python3. - Depends on
apache-beam[gcp]>=2.23,<3. - Depends on
tensorflow>=1.15.2,!=2.0.*,!=2.1.*,!=2.2.*,<2.4. - Depends on
tensorflow-metadata>=0.23,<0.24. - Depends on
tfx-bsl>=0.23,<0.24.
Breaking changes
- Existing caches (for all analyzers) are automatically invalidated.
Deprecations
- Deprecating Py2 support.
- Note: We plan to remove Python 3.5 support after this release.
Version 0.22.0
Major Features and Improvements
Bug Fixes and Other Changes
tft.bucketize_per_keyno longer assumes that the keys during
transformation existed in the analysis dataset. If a key is missing then the
assigned bucket will be -1.tft.estimated_probability_density, whencategorical=True, no longer
assumes that the values during transformation existed in the analysis dataset,
and will assume 0 density in that case.- Switched analyzer cache representation of dataset keys from using a primitive
str to a DatasetKey class. tft_beam.analyzer_cache.ReadAnalysisCacheFromFScan now filter cache entry
keys when given acache_entry_keysparameter.cache_entry_keyscan be
produced by utilizingget_analysis_cache_entry_keys.- Reduced number of shuffles via packing multiple combine merges into a
single Beam combiner. - Switch
tft.TransformFeaturesLayerto use the TF 2tf.saved_model.loadAPI
to load a previously exported SavedModel. - Adds
tft.sparse_tensor_left_alignas a utility which aligns
tf.SparseTensors to the left. - Depends on
avro-python3>=1.8.1,!=1.9.2.*,<2.0.0for Python3.5 + MacOS. - Depends on
apache-beam[gcp]>=2.20.0,<3. - Depends on
tensorflow>=1.15,!=2.0.*,<2.3. - Depends on
tensorflow-metadata>=0.22.0,<0.23.0. - Depends on
tfx-bsl>=0.22.0,<0.23.0.
Breaking changes
tft.AnalyzeDatasetWithCacheno longer accepts a flat pcollection as an
input. Instead it will flatten the datasets in theinput_values_pcoll_dict
input if needed.tft.TransformFeaturesLayerno longer takes a parameter
drop_unused_features. Its default behavior is now equivalent to having set
drop_unused_featurestoTrue.
Deprecations
Release 0.21.2
Release 0.21.2
Major Features and Improvements
- Expanded capability for per-key analyzers to analyze larger sets of keys that
would not fit in memory, by storing the key-value pairs in vocabulary files.
This is enabled by passing aper_key_filenametotft.count_per_keyand
tft.scale_to_z_score_per_key. - Added
tft.TransformFeaturesLayerand
tft.TFTransformOutput.transform_features_layersto allow transforming
features for a TensorFlow Keras model.
Bug Fixes and Other Changes
tft.apply_buckets_with_interpolationnow handles NaN values by imputing with
the middle of the normalized range.- Depends on
tfx-bsl>=0.21.3,<0.22.
Breaking changes
Deprecations
Release 0.21.0
Release 0.21.0
Major Features and Improvements
- Added a new version of the census example to demonstrate usage in TF 2.0.
- New mapper
estimated_probability_densityto compute either exact
probabilities (for discrete categorical variable) or approximate density over
fixed intervals (continuous variables). - New analyzers
count_per_keyandhistogramto return counts of unique
elements or values within predefined ranges. Callingtft.histogramon
non-categorical value will assign each data point to the appropriate fixed
bucket and then count for each bucket. - Provided capability for per-key analyzers to analyze larger sets of keys that
would not fit in memory, by storing the key-value pairs in vocabulary files.
This is enabled by passing aper_key_filenameto
tft.scale_by_min_max_per_keyandtft.scale_to_0_1_per_key.
Bug Fixes and Other Changes
- Added beam counters to log analyzer and mapper usage.
- Cleanup deprecated APIs used in census and sentiment examples.
- Support windows style paths in
analyzer_cache. tft_beam.WriteTransformFnandtft_beam.WriteMetadatahave been made
idempotent to allow retrying them in case of a failure.tft_beam.WriteMetadatatakes an optional argumentwrite_to_unique_subdir
and returns the path to which metadata was written. If
write_to_unique_subdiris True, metadata is written to a unique subdirectory
underpath, otherwise it is written topath.- Support non utf-8 characters when reading vocabularies in
tft.TFTransformOutput tft.TFTransformOutput.vocabulary_by_namenow returns bytes instead of str
with python 3.
Breaking changes
Deprecations
Release 0.15.0
Release 0.15.0
Major Features and Improvements
- This release introduces initial beta support for TF 2.0. TF 2.0 programs
running in "safety" mode (i.e. using TF 1.X APIs through the
tensorflow.compat.v1compatibility module are expected to work. Newly
written TF 2.0 programs may not work if they exercise functionality that is
not yet supported. If you do encounter an issue when using
tensorflow-transformwith TF 2.0, please create an issue
https://github.com/tensorflow/transform/issues with instructions on how to
reproduce it. - Performance improvements for
preprocessing_fnswith many Quantiles
analyzers. tft.quantilesandtft.bucketizeare now using new TF core quantiles ops
instead of contrib ops.- Performance improvements due to packing multiple combine analyzers into a
single Beam Combiner.
Bug Fixes and Other Changes
- Existing analyzer cache is invalidated.
- Saved transforms now support composite tensors (such as
tf.RaggedTensor). - Vocabulary's cache coder now supports non utf-8 encodable tokens.
- Fixes encoding of the
tft.covarianceaccumulator cache. - Fixes encoding per-key analyzers accumulator cache.
- Make various utility methods in
tft.inspect_preprocessing_fnsupport
tf.RaggedTensor. - Moved beam/shared lib to
tfx-bsl. If running with latest master,tfx-bsl
must also be latest master. preprocessing_fns now have beta support of calls totf.functions, as long
as they don't contain calls totf.Transformanalyzers/mappers or table
initializers.tft.quantilesandtft.bucketizeare now using core TF ops.- Depends on
tfx-bsl>=0.15,<0.16. - Depends on
tensorflow-metadata>=0.15,<0.16. - Depends on
apache-beam[gcp]>=2.16,<3. - Depends on
tensorflow>=0.15,<2.2.- Starting from 1.15, package
tensorflowcomes with GPU support. Users won't need to choose between
tensorflowandtensorflow-gpu. - Caveat:
tensorflow2.0.0 is an exception and does not have GPU
support. Iftensorflow-gpu2.0.0 is installed before installing
tensorflow-transform, it will be replaced withtensorflow2.0.0.
Re-installtensorflow-gpu2.0.0 if needed.
- Starting from 1.15, package
Breaking changes
always_return_num_quantileschanged to default to True intft.quantiles
andtft.bucketize, resulting in exact bucket count returned.- Removes the
input_fn_makermodule which has been deprecated since TFT 0.11.
For idiomatic construction ofinput_fn, seetensorflow_transformexamples.
Deprecations
Release 0.14.0
Major Features and Improvements
- New
tft.word_countmapper to identify the number of tokens for each row
(for pre-tokenized strings). - All
tft.scale_to_*mappers now have per-key variants, along with analyzers
formean_and_var_per_keyandmin_and_max_per_key. - New
tft_beam.AnalyzeDatasetWithCacheallows analyzing ranges of data while
producing and utilizing cache.tft.analyzer_cachecan help read and write
such cache to a filesystem between runs. This caching feature is worth using
when analyzing a rolling range in a continuous pipeline manner. This is an
experimental feature. - Added
reduce_instance_dimssupport totft.quantilesandelementwiseto
tft.bucketize, while avoiding separate beam calls for each feature.
Bug Fixes and Other Changes
sparse_tensor_to_dense_with_shapenow accepts an optionaldefault_value
parameter.tft.vocabularyandtft.compute_and_apply_vocabularynow support
fingerprint_shuffleto sort the vocabularies by fingerprint instead of
counts. This is useful for load balancing the training parameter servers.
This is an experimental feature.- Fix numerical instability in
tft.vocabularymutual information calculations. tft.vocabularyandtft.compute_and_apply_vocabularynow support computing
vocabularies over integer categoricals and multivalent input features, and
computing mutual information for non-binary labels.- New numeric normalization method available:
tft.apply_buckets_with_interpolation. - Changes to make this library more compatible with TensorFlow 2.0.
- Fix sanitizing of vocabulary filenames.
- Emit a friendly error message when context isn't set.
- Analyzer output dtypes are enforced to be TensorFlow dtypes, and by extension
ptransform_analyzer'soutput_dtypesis enforced to be a list of TensorFlow
dtypes. - Make
tft.apply_buckets_with_interpolationsupport SparseTensors. - Adds an experimental api for analyzers to annotate the post-transform schema.
TFTransformOutput.transform_raw_featuresnow accepts an optional
drop_unused_featuresparameter to exclude unused features in output.- If not specified, the min_diff_from_avg parameter of
tft.vocabularynow
defaults to a reasonable value based on the size of the dataset (relevant
only if computing vocabularies using mutual information). - Convert some
tf.contribfunctions to be compatible with TF2.0. - New
tft.bag_of_wordsmapper to compute the unique set of ngrams for each row
(for pre-tokenized strings). - Fixed a bug in
tf_utils.reduce_batch_count_mean_and_var, and as a result
mean_and_varanalyzer, was miscalculating variance for the sparse
elementwise=True case. - At test utility
tft_unit.cross_named_parametersfor creating parameterized
tests that involve the cartesian product of various parameters. - Depends on
tensorflow-metadata>=0.14,<0.15. - Depends on
apache-beam[gcp]>=2.14,<3. - Depends on
numpy>=1.16,<2. - Depends on
absl-py>=0.7,<2. - Allow
preprocessing_fnto emit atf.RaggedTensor. In this case, the
outputSchemaproto will not be able to be converted to a feature spec,
and so the output data will not be able to be materialized withtft.coders. - Ability to directly set exact
num_bucketswith new parameter
always_return_num_quantilesforanalyzers.quantilesand
mappers.bucketize, defaulting to False in general but True when
reduce_instance_dimsis False.
Breaking changes
tf_utils.reduce_batch_count_mean_and_var, which feeds into
tft.mean_and_var, now returns 0 instead of inf for empty columns of a
sparse tensor.tensorflow_transform.tf_metadata.dataset_schema.Schemaclass is removed.
Wherever adataset_schema.Schemawas used, users should now provide a
tensorflow_metadata.proto.v0.schema_pb2.Schemaproto. For backwards
compatibility,dataset_schema.Schemais now a factory method that produces
aSchemaproto. Updating code should be straightforward because the
dataset_schema.Schemaclass was already a wrapper around theSchemaproto.- Only explicitly public analyzers are exported to the
tftmodule, e.g.
combiners are no longer exported and have to be accessed directly through
tft.analyzers. - Requires pre-installed TensorFlow >=1.14,<2.
Deprecations
DatasetSchemais now a deprecated factory method (see above).tft.tf_metadata.dataset_schema.from_feature_specis now deprecated.
Equivalent functionality is provided by
tft.tf_metadata.schema_utils.schema_from_feature_spec.
Release 0.13.0
Major Features and Improvements
- Now
AnalyzeDataset,TransformDatasetandAnalyzeAndTransformDatasetcan
accept input data that only contains columns needed for that operation as
opposed to all columns defined in schema. Utility methods to infer the list of
needed columns are added totft.inspect_preprocessing_fn. This makes it
easier to take advantage of columnar projection when data is stored in
columnar storage formats. - Python 3.5 is supported.
Bug Fixes and Other Changes
- Version is now accessible as
tensorflow_transform.__version__. - Depends on
apache-beam[gcp]>=2.11,<3. - Depends on
protobuf>=3.7,<4.
Breaking changes
- Coders now return index and value features rather than a combined feature for
SparseFeature. - Requires pre-installed TensorFlow >=1.13,<2.
Deprecations
Release 0.12.0
Major Features and Improvements
- Python 3.5 readiness complete (all tests pass). Full Python 3.5 compatibility
is expected to be available with the next version of Transform (after
Apache Beam 2.11 is released). - Performance improvements for vocabulary generation when using top_k.
- New optimized highly experimental API for analyzing a dataset was added,
AnalyzeDatasetWithCache, which allows reading and writing analyzer cache. - Update
DatasetMetadatato be a wrapper around the
tensorflow_metadata.proto.v0.schema_pb2.Schemaproto. TensorFlow Metadata
will be the schema used to define data parsing across TFX. The serialized
DatasetMetadatais now theSchemaproto in ascii format, but the previous
format can still be read. - Change
ApplySavedModelimplementation to usetf.Session.make_callable
instead oftf.Session.runfor improved performance.
Bug Fixes and Other Changes
tft.vocabularyandtft.compute_and_apply_vocabularynow support
filtering based on adjusted mutual information when
use_adjusetd_mutual_infois set to True.tft.vocabularyandtft.compute_and_apply_vocabularynow takes
regularization term 'min_diff_from_avg' that adjusts mutual information to
zero whenever the difference between count of the feature with any label and
its expected count is lower than the threshold.- Added an option to
tft.vocabularyandtft.compute_and_apply_vocabulary
to compute a coverage vocabulary, using the newcoverage_top_k,
coverage_frequency_thresholdandkey_fnparameters. - Added
tft.ptransform_analyzerfor advanced use cases. - Modified
QuantilesCombinerto usetf.Session.make_callableinstead of
tf.Session.runfor improved performance. - ExampleProtoCoder now also supports non-serialized Example representations.
tft.tfidfnow accepts a scalar Tensor asvocab_size.assertItemsEqualin unit tests are replaced byassertCountEqual.NumPyCombinernow outputs TF dtypes in output_tensor_infos instead of
numpy dtypes.- Adds function
tft.apply_pyfuncthat provides limited support for
tf.pyfunc. Note that this is incompatible with serving. See documentation
for more details. CombinePerKeynow adds a dimension for the key.- Depends on
numpy>=1.14.5,<2. - Depends on
apache-beam[gcp]>=2.10,<3. - Depends on
protobuf==3.7.0rc2. ExampleProtoCoder.encodenow converts a feature whose value isNoneto an
empty value, where before it did not acceptNoneas a valid value.AnalyzeDataset,AnalyzeAndTransformDatasetandTransformDatasetcan now
accept dictionaries which containNone, and which will be interpreted the
same as an empty list. They will never produce an output containingNone.
Breaking changes
ColumnSchemaand related classes (Domain,Axisand
ColumnRepresentationand their subclasses) have been removed. In order to
create a schema, usefrom_feature_spec. In order to inspect a schema
use theas_feature_specanddomainsmethods ofSchema. The
constructors of these classes are replaced by functions that still work when
creating aSchemabut this usage is deprecated.- Requires pre-installed TensorFlow >=1.12,<2.
ExampleProtoCoder.decodenow converts a feature with empty value (e.g.
features { feature { key: "varlen" value { } } }) or missing key for a
feature (e.g.features { }) to aNonein the output dictionary. Before
it would represent these with an empty list. This better reflects the
original example proto and is consistent with TensorFlow Data Validation.- Coders now returns a
listinstead of anndarrayfor aVarLenFeature.
Deprecations
Release 0.11.0
Major Features and Improvements
Bug Fixes and Other Changes
- 'tft.vocabulary' and 'tft.compute_and_apply_vocabulary' now support filtering
based on mutual information whenlabelsis provided. - Export all package level exports of
tensorflow_transform, from the
tensorflow_transform.beamsubpackage. This allows users to just import the
tensorflow_transform.beamsubpackage for all functionality. - Adding API docs.
- Fix bug where Transform returned a different dtype for a VarLenFeature with
0 elements. - Depends on
apache-beam[gcp]>=2.8,<3.
Breaking changes
- Requires pre-installed TensorFlow >=1.11,<2.
Deprecations
- All functions in
tensorflow_transform.saved.input_fn_makerare deprecated.
See the examples for how to construct theinput_fnfor training and serving.
Note that the examples demonstrate the use of thetf.estimatorAPI. The
functions named *_serving_input_fn were for use with the
tf.contrib.estimatorAPI which is now deprecated. We do not provide
examples of usage of thetf.contrib.estimatorAPI, instead users should
upgrade to thetf.estimatorAPI.
Release 0.9.0
Major Features and Improvements
- Performance improvements for vocabulary generation when using top_k.
- Utility to deep-copy Beam
PCollections was added to avoid unnecessary
materialization. - Utilize deep_copy to avoid unnecessary materialization of pcollections when
the input data is immutable. This feature is currently off by default and can
be enabled by settingtft.Context.use_deep_copy_optimization=True. - Add bucketize_per_key which computes separate quantiles for each key and then
bucketizes each value according to the quantiles computed for its key. tft.scale_to_z_scoreis now implemented with a single pass over the data.- Export schema_utils package to convert from the
tensorflow-metadatapackage
to the (soon to be deprecated)tf_metadatasubpackage of
tensorflow-transform.
Bug Fixes and Other Changes
- Memory reduction during vocabulary generation.
- Clarify documentation on return values from
tft.compute_and_apply_vocabulary
andtft.string_to_int. tft.unitnow explicitly creates Beam PCollections and validates the
transformed dataset by writing and then reading it from disk.tft.min,tft.size,tft.sum,tft.scale_to_z_scoreandtft.bucketize
now supporttf.SparseTensor.- Fix to
tft.scale_to_z_scoreso it no longer attempts to divide by 0 when the
variance is 0. - Fix bug where internal graph analysis didn't handle the case where an
operation has control inputs that are operations (as opposed to tensors). tft.sparse_tensor_to_dense_with_shapeadded which allows densifying a
SparseTensorwhile specifying the resultingTensor's shape.- Add
load_transform_graphmethod toTFTransformOutputto load the transform
graph without applying it. This has the effect of adding variables to the
checkpoint when calling it from the traininginput_fnwhen using
tf.Estimator. - 'tft.vocabulary' and 'tft.compute_and_apply_vocabulary' now accept an
optionalweightsargument. Whenweightsis provided, weighted frequencies
are used instead of frequencies based on counts. - 'tft.quantiles' and 'tft.bucketize' now accept an optoinal
weightsargument.
Whenweightsis provided, weighted count is used for quantiles instead of
the counts themselves. - Updated examples to construct the schema using
dataset_schema.from_feature_spec. - Updated the census example to allow the 'education-num' feature to be missing
and fill in a default value when it is. - Depends on
tensorflow-metadata>=0.9,<1. - Depends on
apache-beam[gcp]>=2.6,<3.
Breaking changes
- We now validate a
Schemain its constructor to make sure that it can be
converted to a feature spec. In particular onlytf.int64,tf.stringand
tf.float32types are allowed. - We now disallow default values for
FixedColumnRepresentation. - It is no longer possible to set a default value in the Schema, and validation
of shape parameters will occur earlier. - Removed Schema.as_batched_placeholders() method.
- Removed all components of DatasetMetadata except the schema, and removed all
related classes and code. - Removed the merge method for DatasetMetadata and related classes.
- read_metadata can now only read from a single metadata directory and
read_metadata and write_metadata no longer accept theversionsparameter.
They now only read/write the JSON format. - Requires pre-installed TensorFlow >=1.9,<2.
Deprecations
apply_functionis no longer needed and is deprecated.
apply_function(fn, *args)is now equivalent tofn(*args). tf.Transform
is able to handle while loops and tables without the user wrapping the
function call inapply_function.