Feature columns are used to specify how Tensors received from the input function should be combined and transformed before entering the model. A feature column can be a plain mapping to some input column (e.g. column_numeric()
for a column of numerical data), or a transformation of other feature columns (e.g. column_crossed()
to define a new column as the cross of two other feature columns).
The following feature columns are available:
Feature Column | Description |
column_categorical_with_vocabulary_list() |
Construct a Categorical Column with In-Memory Vocabulary. |
column_categorical_with_vocabulary_file() |
Construct a Categorical Column with a Vocabulary File. |
column_categorical_with_identity() |
Construct a Categorical Column that Returns Identity Values. |
column_categorical_with_hash_bucket() |
Represents Sparse Feature where IDs are set by Hashing. |
column_categorical_weighted() |
Construct a Weighted Categorical Column. |
column_indicator() |
Represents Multi-Hot Representation of Given Categorical Column. |
column_numeric() |
Construct a Real-Valued Column. |
column_embedding() |
Construct a Dense Column. |
column_crossed() |
Construct a Crossed Column. |
column_bucketized() |
Construct a Bucketized Column. |
Some typical mappings of R data types to feature column are:
Data Type | Feature Column |
Numeric | column_numeric() |
Factor | column_categorical_with_identity() |
Character | column_categorical_with_hash_bucket() |
We’ll use the flights dataset from the nycflights13 package to explore how feature columns can be constructed. The flights dataset records airline on-time data for all flights departing NYC in 2013.
> print(flights)
# A tibble: 336,776 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr> <chr> <chr> <dbl>
1 2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227
2 2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227
3 2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160
# ... with 336,766 more rows, and 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
For example, we can define numeric columns based on the dep_time
and dep_delay
You can also define multiple feature columns at once.
Often, you will find that you want to generate a number of feature column definitions based on some pattern existing in the names of your data set. tfestimators uses the tidyselect package to make it easy to define feature columns, similar to what you might be familiar with in the dplyr
package. You can use the names =
argument of feature_columns()
function to define a context from which variable names will be selected.
For example, we can use the ends_with()
helper to assert that all columns ending with "time"
are numeric columns as follows:
The names
parameter can either be a character vector with the names as-is, or any named R object.
If the code you are using to compose columns is more complicated, or if you need to save references to columns for use in column embeddings you can also establish a scope for given set of column names using the with_columns()
You can also use an alternate syntax of the form (pattern) ~ (column)
, which can add clarity when longer pattern rules are used, as it separates the matching rule from the column definition:
Available pattern matching operators include:
Operator | Description |
starts_with() |
Starts with a prefix |
ends_with() |
Ends with a suffix |
contains() |
Contains a literal string |
matches() |
Matches a regular expression |
one_of() |
Included in character vector |
everything() |
All columns |
See help("select_helpers", package = "tidyselect")
for full information on the set of helpers made available by the tidyselect package.