In this document we will demonstrate the basic usage of the
feature_spec
interface in tfdatasets
.
The feature_spec
interface is a user friendly interface
to feature_columns
.
It allows us to specify column transformations and representations when
working with structured data.
We will use the hearts
dataset and it can be loaded with
data(hearts)
.
library(tfdatasets)
library(dplyr)
data(hearts)
head(hearts)
We want to train a model to predict the target
variable
using Keras but, before that we need to prepare the data. We need to
transform the categorical variables into some form of dense variable, we
usually want to normalize all numeric columns too.
The feature spec interface works with data.frame
s or
TensorFlow datasets objects.
<- sample.int(nrow(hearts), size = 0.75*nrow(hearts))
ids_train <- hearts[ids_train,]
hearts_train <- hearts[-ids_train,] hearts_test
Now let’s start creating our feature specification:
<- feature_spec(hearts_train, target ~ .) spec
The first thing we need to do after creating the feature_spec is decide on the variables’ types.
We can do this by adding steps to the spec
object.
<- spec %>%
spec step_numeric_column(
all_numeric(), -cp, -restecg, -exang, -sex, -fbs,
normalizer_fn = scaler_standard()
%>%
) step_categorical_column_with_vocabulary_list(thal)
The following steps can be used to define the variable type:
step_numeric_column
to define numeric variablesstep_categorical_with_vocabulary_list
for categorical
variables with a fixed vocabularystep_categorical_column_with_hash_bucket
for
categorical variables using the hash trickstep_categorical_column_with_identity
to store
categorical variables as integersstep_categorical_column_with_vocabulary_file
when you
have the possible vocabulary in a fileWhen using step_categorical_column_with_vocabulary_list
you can also provide a vocabulary
argument with the fixed
vocabulary. The recipe will find all the unique values in the dataset
and use it as the vocabulary.
You can also specify a normalizer_fn
to the
step_numeric_column
. In this case the variable will be
transformed by the feature column. Note that the transformation will
occur in the TensorFlow Graph, so it must use only TensorFlow ops. Like
in the example we offer pre-made normalizers - and they will compute the
normalizing function during the recipe preparation.
You can also use selectors like:
starts_with()
, ends_with()
,
matches()
etc. (from tidyselect)all_numeric()
to select all numeric variablesall_nominal()
to select all stringshas_type("float32")
to select based on TensorFlow
variable type.Now we can print the recipe:
spec
After specifying the types of the columns you can add transformation steps. For example you may want to bucketize a numeric column:
<- spec %>%
spec step_bucketized_column(age, boundaries = c(18, 25, 30, 35, 40, 45, 50, 55, 60, 65))
You can also specify the kind of numeric representation that you want to use for your categorical variables.
<- spec %>%
spec step_indicator_column(thal) %>%
step_embedding_column(thal, dimension = 2)
Another common transformation is to add interactions between variables using crossed columns.
<- spec %>%
spec step_crossed_column(thal_and_age = c(thal, bucketized_age), hash_bucket_size = 1000) %>%
step_indicator_column(thal_and_age)
Note that the crossed_column
is a categorical column, so
we need to also specify what kind of numeric tranformation we want to
use. Also note that we can name the transformed variables - each step
uses a default naming for columns, eg. bucketized_age
is
the default name when you use step_bucketized_column
with
column called age
.
With the above code we have created our recipe. Note we can also define the recipe by chaining a sequence of methods:
<- feature_spec(hearts_train, target ~ .) %>%
spec step_numeric_column(
all_numeric(), -cp, -restecg, -exang, -sex, -fbs,
normalizer_fn = scaler_standard()
%>%
) step_categorical_column_with_vocabulary_list(thal) %>%
step_bucketized_column(age, boundaries = c(18, 25, 30, 35, 40, 45, 50, 55, 60, 65)) %>%
step_indicator_column(thal) %>%
step_embedding_column(thal, dimension = 2) %>%
step_crossed_column(c(thal, bucketized_age), hash_bucket_size = 10) %>%
step_indicator_column(crossed_thal_bucketized_age)
After defining the recipe we need to fit
it. It’s when
fitting that we compute the vocabulary list for categorical variables or
find the mean and standard deviation for the normalizing functions.
Fitting involves evaluating the full dataset, so if you have provided
the vocabulary list and your columns are already normalized you can skip
the fitting step (TODO).
In our case, we will fit the feature spec, since we didn’t specify the vocabulary list for the categorical variables.
<- fit(spec) spec_prep
After preparing we can see the list of dense features that were defined:
str(spec_prep$dense_features())
Now we are ready to define our model in Keras. We will use a
specialized layer_dense_features
that knows what to do with
the feature columns specification.
We also use a new layer_input_from_dataset
that is
useful to create a Keras input object copying the structure from a
data.frame
or TensorFlow dataset.
library(keras)
<- layer_input_from_dataset(hearts_train %>% select(-target))
input
<- input %>%
output layer_dense_features(dense_features(spec_prep)) %>%
layer_dense(units = 32, activation = "relu") %>%
layer_dense(units = 1, activation = "sigmoid")
<- keras_model(input, output)
model
%>% compile(
model loss = loss_binary_crossentropy,
optimizer = "adam",
metrics = "binary_accuracy"
)
We can finally train the model on the dataset:
<- model %>%
history fit(
x = hearts_train %>% select(-target),
y = hearts_train$target,
epochs = 15,
validation_split = 0.2
)
plot(history)
Finally we can make predictions in the test set and calculate performance metrics like the AUC of the ROC curve:
$pred <- predict(model, hearts_test %>% select(-target))
hearts_test::auc(hearts_test$target, hearts_test$pred) Metrics