ElasticDL TF Transform Explore

TensorFlow Transform in ElasticDL Explore

Motivation

Data preprocess is an important part before model training in the entire ML pipeline. Consistency between offline and online is the key point of preprocess. Both TensorFlow Transform and Feature Column can ensure the consistency.

Why Feature Column Api is not enough

Feature Column Api can do a part of Transform work. But it can't cover all the feature engineering requirements. Especially in the following two aspects:

Analyzer
Let's take scaling one column of dense data to [0, 1) for example.

import tensorflow as tf
def _scale_age_to_0_1(input_tensor):
    min_age = 1
    max_age = 100
    return (input_tensor - min_age) / (max_age - min_age)
tf.feature_column.numeric_column('age', normalizer_fn=_scale_age_to_0_1)

We need define the min and max value of column age as constants in the code. Min and Max are the statistical value after scanning the entire dataset. It's common that we will refit the model using the latest data everyday. The statistic value varies in the data of different days. It's impractical to update these value in the code everyday for the daily job.

Inter Columns Calculation
From feature column Api doc, except cross_column, all the other columns execute the transform on only one column of data. We can't implement inter columns calculation using Feature Column Api just as follows:

column_new = column_a * column_b

Why TensorFlow Transform

The key preprocess logic is defined in a user defined function: preprocessing_fn(inputs). The function will be traced into a TF Graph. The analyze node will be convert to a placeholder tensor at the first step. After analyzing the entire dataset and calculating the result of all the analyze node, then TF-Transform will replace the placeholder tensor with the analyze result as a constant tensor. And then we can use this TF Graph to transform the data records one by one. Finally we can export the transform Graph as SavedModel to integrate with the Inference Graph.
Please check the official tutorial.

Analyzer
With TF-Transform, we only use just one Api tft.scale_to_0_1. TF Transform will analyze the whole dataset at first, calculate the min and max. After getting the analysis result, it then transform the data.

import tensorflow_transform as tft
outputs['age'] = tft.scale_to_0_1(inputs['age'])

Inter Columns Calculation
Inside preprocess_fn(input), user can write any transform logic inside this function. Operations between two or more columns are naturally supported if it can be traced to TF Graph.

Transform Execution

Apache Beam and Runtime Engine

The entire TF Transform process is a data processing pineline. TF Transform is integrated closely with Apache Beam and use it to describe the pipeline as a DAG. Apache Beam pipeline is runtime engine neutral and can be translated to the execution plan of different data process engines (such as DataFlow, Flink, Spark and so on).

We need a data process engine maturally supporting Apache Beam.
We can run python scripts to process the data in parallel on this engine.

Walkthrough Issues

Integration with SQLFlow

Let's check the typical SQL expression for model training as follows. SELECT * From iris.train means retriving data from the data source. It's mapped to SQL query in database or Odps SQL in Odps table. COLUMN sepal_length, sepal_width is mapped to a feature_column array.

SELECT *
FROM iris.train
TO TRAIN DNNClassifier
WITH model.n_classes = 3, model.hidden_units = [10, 20]
COLUMN sepal_length, sepal_width, petal_length, petal_width
LABEL class
INTO sqlflow_models.my_dnn_model;

We need extend the SQLFlow synatx to fully express the TF Transform logic. The key of the transform process is function defining a beam pipeline and an customized preprocess_fn. Both are user defined and very flexible. We recommend to write them in a separate python file and refer it using the TRANSFORM keyword in the SQL expression. Just as the expression below, iris_transform is the python file name and transform is the function name defining the Beam Pipeline.

SELECT *
FROM iris.train
TO TRAIN DNNClassifier
WITH model.n_classes = 3, model.hidden_units = [10, 20]
TRANSFORM iris_transformer.transform
COLUMN sepal_length, sepal_width, petal_length, petal_width
LABEL class
INTO sqlflow_models.my_dnn_model;

Export Transform and Model Definition Together to SavedModel in TF2.0

From official tutorial example, TF-Transform is integrated with Estimator. While exporting the model as SavedModel, we need construct the serving_input_fn from the tf_transform_output to define the inference signature. For TF2.0, we are using keras the define the model and inference signature is generated automatically in this case. Model with feature columns works fine. Not sure whether it works well with TF-Transform + Feature Columns?

Open Questions

The output columns from TF Transform are defined in the python code. How do we map them to the COLUMN expression in SQLFlow?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly