This document is an adaptation of the official TensorFlow Feature Columns guide.
This document details feature columns and how they can be used as inputs to neural networks using TensorFlow. Feature columns are very rich, enabling you to transform a diverse range of raw data into formats that neural networks can use, allowing easy experimentation.
What kind of data can a deep neural network operate on? The answer
is, of course, numbers (for example, tf$float32
). After
all, every neuron in a neural network performs multiplication and
addition operations on weights and input data. Real-life input data,
however, often contains non-numerical (categorical) data. For example,
consider a product_class
feature that can contain the
following three non-numerical values:
ML models generally represent categorical values as simple vectors in
which a 1 represents the presence of a value and a 0 represents the
absence of a value. For example, when product_class
is set
to sports, an ML model would usually represent
product_class
as [0, 0, 1]
, meaning:
So, although raw data can be numerical or categorical, an ML model represents all features as numbers.
This document explains nine of the feature columns available in
tfdatasets. As the following figure shows, all nine functions return
either a categorical_column
or a dense_column
object, except bucketized_column
, which inherits from both
classes:
Let’s look at these functions in more detail.
We are going to use the feature_spec
interface in the
examples. The feature_spec
interface is an abstraction that
makes it easier to define feature columns in R.
You can initialize a feature_spec
with:
library(tfdatasets)
<- tensor_slices_dataset(hearts)
hearts_dataset <- feature_spec(hearts_dataset, target ~ .) spec
We then add steps in order to define feature_columns
.
Read the feature_spec
vignette for more information.
We use step_numeric_column
to add numeric columns to our
spec
.
%>%
spec step_numeric_column(age)
By default, a numeric column creates a single value (scalar). Use the shape argument to specify another shape. For example:
# Represent a 10-element vector in which each cell contains a tf$float32.
%>%
spec step_numeric_column(bowling, shape = 10)
# Represent a 10x5 matrix in which each cell contains a tf$float32.
%>%
spec step_numeric_column(my_matrix, shape = c(10, 5))
A nice feature of step_numeric_column
is that you can
also specify normalizer functions. When using a scaler, the scaling
constants will be learned from data when fitting the
feature_spec
.
# use a function that defines tensorflow ops.
%>%
spec step_numeric_column(age, normalizer_fn = function(x) (x-10)/5)
# use a scaler
%>%
spec step_numeric_column(age, normalizer_fn = scaler_standard())
Often, you don’t want to feed a number directly into the model, but
instead split its value into different categories based on numerical
ranges. To do so, create a bucketized_column
. For example,
consider raw data that represents the year a house was built. Instead of
representing that year as a scalar numeric column, we could split the
year into the following four buckets:
Why would you want to split a number — a perfectly valid input to your model — into a categorical value? Well, notice that the categorization splits a single input number into a four-element vector. Therefore, the model now can learn four individual weights rather than just one; four weights creates a richer model than one weight. More importantly, bucketizing enables the model to clearly distinguish between different year categories since only one of the elements is set (1) and the other three elements are cleared (0). For example, when we just use a single number (a year) as input, a linear model can only learn a linear relationship. So, bucketing provides the model with additional flexibility that the model can use to learn.
The following code demonstrates how to create a bucketized feature:
# First, convert the raw input to a numeric column.
<- spec %>%
spec step_numeric_column(age)
# Then, bucketize the numeric column.
<- spec %>%
spec step_bucketized_column(age, boundaries = c(30, 50, 70))
Note that specifying a three-element boundaries vector creates a four-element bucketized vector.
Categorical identity columns are to tfdatasets
what
factors
are to R. Put differently, they can be seen as a
special case of bucketized columns. In traditional bucketized columns,
each bucket represents a range of values (for example, from 1960 to
1979). In a categorical identity column, each bucket represents a
single, unique integer. For example, let’s say you want to represent the
integer range [0, 4). That is, you want to represent the integers 0, 1,
2, or 3. In this case, the categorical identity mapping looks like
this:
As with bucketized columns, a model can learn a separate weight for
each class in a categorical identity column. For example, instead of
using a string to represent the product_class
, let’s
represent each class with a unique integer value. That is:
Call step_categorical_column_with_identity
to add a
categorical identity column to the feature_spec
. For
example:
# Create categorical output for an integer feature named "my_feature_b",
# The values of my_feature_b must be >= 0 and < num_buckets
<- spec %>%
spec step_categorical_column_with_identity(my_feature_b, num_buckets = 4)
We cannot input strings directly to a model. Instead, we must first map strings to numeric or categorical values. Categorical vocabulary columns provide a good way to represent strings as a one-hot vector. For example:
As you can see, categorical vocabulary columns are kind of an enum
version of categorical identity columns. tfdatasets
provides two different functions to create categorical vocabulary
columns:
step_categorical_column_with_vocabulary_list
step_categorical_column_with_vocabulary_file
categorical_column_with_vocabulary_list
maps each string
to an integer based on an explicit vocabulary list. For example:
<- spec %>%
spec step_categorical_column_with_vocabulary_list(
thal, vocabulary_list = c("fixed", "normal", "reversible")
)
Note that the vocabulary_list
argument is optional in R
and the vocabulary will be discovered when fitting the
featture_spec
which saves us a lot of typing.
You can also place the vocabulary in a separate file that should contain one line for each vocabulary element. You can then use:
<- spec %>%
spec step_categorical_column_with_vocabulary_file(thal, vocabulary_file = "thal.txt")
So far, we’ve worked with a naively small number of categories. For
example, our product_class example has only 3 categories. Often though,
the number of categories can be so big that it’s not possible to have
individual categories for each vocabulary word or integer because that
would consume too much memory. For these cases, we can instead turn the
question around and ask, “How many categories am I willing to have for
my input?” In fact, the
step_categorical_column_with_hash_bucket
function enables
you to specify the number of categories. For this type of feature column
the model calculates a hash value of the input, then puts it into one of
the hash_bucket_size categories using the modulo operator, as in the
following pseudocode:
# pseudocode
feature_id = hash(raw_feature) % hash_bucket_size
The code to add the feature_column
to the
feature_spec
might look something like this:
<- spec %>%
spec step_categorical_column_with_hash_bucket(thal, hash_bucket_size = 100)
At this point, you might rightfully think: “This is crazy!” After all, we are forcing the different input values to a smaller set of categories. This means that two probably unrelated inputs will be mapped to the same category, and consequently mean the same thing to the neural network. The following figure illustrates this dilemma, showing that kitchenware and sports both get assigned to category (hash bucket) 12:
As with many counterintuitive phenomena in machine learning, it turns out that hashing often works well in practice. That’s because hash categories provide the model with some separation. The model can use additional features to further separate kitchenware from sports.
Combining features into a single feature, better known as feature crosses, enables the model to learn separate weights for each combination of features.
More concretely, suppose we want our model to calculate real estate prices in Atlanta, GA. Real-estate prices within this city vary greatly depending on location. Representing latitude and longitude as separate features isn’t very useful in identifying real-estate location dependencies; however, crossing latitude and longitude into a single feature can pinpoint locations. Suppose we represent Atlanta as a grid of 100x100 rectangular sections, identifying each of the 10,000 sections by a feature cross of latitude and longitude. This feature cross enables the model to train on pricing conditions related to each individual section, which is a much stronger signal than latitude and longitude alone.
The following figure shows our plan, with the latitude & longitude values for the corners of the city in red text:
For the solution, we used a combination of the bucketized_column we
looked at earlier, with the step_crossed_column
function.
<- feature_spec(dataset, target ~ latitute + longitude) %>%
spec step_numeric_column(latitude, longitude) %>%
step_bucketized_column(latitude, boundaries = c(latitude_edges)) %>%
step_bucketized_column(longitude, boundaries = c(longitude_edges)) %>%
step_crossed_column(latitude_longitude = c(latitude, longitude), hash_bucket_size = 100)
You may create a feature cross from either of the following:
categorical_column_with_hash_bucket
(since crossed_column
hashes the input).When the feature columns latitude_bucket_fc
and
longitude_bucket_fc
are crossed, TensorFlow will create
(latitude_fc, longitude_fc)
pairs for each example. This
would produce a full grid of possibilities as follows:
(0,0), (0,1)... (0,99)
(1,0), (1,1)... (1,99)
... ... ...
(99,0), (99,1)...(99, 99)
Except that a full grid would only be tractable for inputs with
limited vocabularies. Instead of building this, potentially huge, table
of inputs, the crossed_column
only builds the number
requested by the hash_bucket_size argument
. The feature
column assigns an example to a index by running a hash function on the
tuple of inputs, followed by a modulo operation with
hash_bucket_size
.
As discussed earlier, performing the hash and modulo function limits the number of categories, but can cause category collisions; that is, multiple (latitude, longitude) feature crosses will end up in the same hash bucket. In practice though, performing feature crosses still adds significant value to the learning capability of your models.
Somewhat counterintuitively, when creating feature crosses, you typically still should include the original (uncrossed) features in your model (as in the preceding code snippet). The independent latitude and longitude features help the model distinguish between examples where a hash collision has occurred in the crossed feature.
Indicator columns and embedding columns never work on features directly, but instead take categorical columns as input.
When using an indicator column, we’re telling TensorFlow to do exactly what we’ve seen in our categorical product_class example. That is, an indicator column treats each category as an element in a one-hot vector, where the matching category has value 1 and the rest have 0s:
Here’s how you create an indicator column:
<- feature_spec(dataset, target ~ .) %>%
spec step_categorical_column_with_vocabulary_list(product_class) %>%
step_indicator_column(product_class)
Now, suppose instead of having just three possible classes, we have a million. Or maybe a billion. For a number of reasons, as the number of categories grow large, it becomes infeasible to train a neural network using indicator columns.
We can use an embedding column to overcome this limitation. Instead of representing the data as a one-hot vector of many dimensions, an embedding column represents that data as a lower-dimensional, ordinary vector in which each cell can contain any number, not just 0 or 1. By permitting a richer palette of numbers for every cell, an embedding column contains far fewer cells than an indicator column.
Let’s look at an example comparing indicator and embedding columns. Suppose our input examples consist of different words from a limited palette of only 81 words. Further suppose that the data set provides the following input words in 4 separate examples:
In that case, the following figure illustrates the processing path for embedding columns or indicator columns.
When an example is processed, one of the categorical_column_with_[…] functions maps the example string to a numerical categorical value. For example, a function maps “spoon” to [32]. (The 32 comes from our imagination — the actual values depend on the mapping function.) You may then represent these numerical categorical values in either of the following two ways:
As an indicator column. A function converts each numeric categorical value into an 81-element vector (because our palette consists of 81 words), placing a 1 in the index of the categorical value (0, 32, 79, 80) and a 0 in all the other positions.
As an embedding column. A function uses the numerical categorical values (0, 32, 79, 80) as indices to a lookup table. Each slot in that lookup table contains a 3-element vector.
How do the values in the embeddings vectors magically get assigned? Actually, the assignments happen during training. That is, the model learns the best way to map your input numeric categorical values to the embeddings vector value in order to solve your problem. Embedding columns increase your model’s capabilities, since an embeddings vector learns new relationships between categories from the training data.
Why is the embedding vector size 3 in our example? Well, the following “formula” provides a general rule of thumb about the number of embedding dimensions:
embedding_dimensions = number_of_categories**0.25
That is, the embedding vector dimension should be the 4th root of the number of categories. Since our vocabulary size in this example is 81, the recommended number of dimensions is 3:
3 = 81**0.25
Note: This is just a general guideline; you can set the number of embedding dimensions as you please.
Call step_embedding_column
to create an embedding_column
as suggested by the following snippet:
<- feature_spec(dataset, target ~ .) %>%
spec step_categorical_column_with_vocabulary_list(product_class) %>%
step_embedding_column(product_class, dimension = 3)
Embeddings is a significant topic within machine learning. This information was just to get you started using them as feature columns.
After creating and fitting a feature_spec
object you can
use the dense_features
to get the Dense features from the
specifications. You can then use the layer_dense_features
function from Keras to create a layer that will automatically initialize
values.
Here is an an example of how it would work:
library(keras)
library(dplyr)
<- feature_spec(hearts, target ~ .) %>%
spec step_numeric_column(
all_numeric(), -cp, -restecg, -exang, -sex, -fbs,
normalizer_fn = scaler_standard()
%>%
) step_categorical_column_with_vocabulary_list(thal) %>%
step_bucketized_column(age, boundaries = c(18, 25, 30, 35, 40, 45, 50, 55, 60, 65)) %>%
step_indicator_column(thal) %>%
step_embedding_column(thal, dimension = 2) %>%
step_crossed_column(c(thal, bucketized_age), hash_bucket_size = 10) %>%
step_indicator_column(crossed_thal_bucketized_age)
<- fit(spec)
spec
<- layer_input_from_dataset(hearts %>% select(-target))
input <- input %>%
output layer_dense_features(feature_columns = dense_features(spec)) %>%
layer_dense(units = 1, activation = "sigmoid")
<- keras_model(input, output)
model
%>%
model compile(
loss = "binary_crossentropy",
optimizer = "adam",
metrics = "accuracy"
)
%>%
model fit(
x = hearts %>% select(-target), y = hearts$target,
validation_split = 0.2
)