FeatureHub Tutorial¶

In this tutorial, we will go through the functionality offered by FeatureHub, a cloud platform for feature engineering.

What is feature engineering? Feature engineering is the step in the data science pipeline in which raw variables are transformed to create features ready for inclusion in a machine learning model. This is a critical step in a typical data science pipeline and can be one of the most challenging aspects of a prediction task. Indeed, human intuition and domain expertise often play a key role in the generation of high-quality features.

Prepare your session¶

(back to top)

We import commands, the FeatureHub client, into our workspace. We’ll use this client to acquire data, evaluate our features, and register them with the Feature database.

In [ ]:

from featurehub.problems.demo import commands

Acquire dataset¶

We use get_sample_dataset to load the training data into our workspace. The result is a tuple where the first element is dataset, a dictionary mapping table names to Pandas DataFrames, and the second element is target, a Pandas DataFrame with one column. This one column is what we are trying to predict.

In [ ]:

dataset, target = commands.get_sample_dataset()

We can explore our data inline in the Notebook.

In [ ]:

list(dataset.keys())

In [ ]:

dataset["users"].head()

In [ ]:

dataset["groups"].head()

In [ ]:

target.head()

Explore existing features¶

We can use several methods to see what features have already been registered to the Feature database.

The first method, print_my_features, prints the features that you have registered to the database.

In [ ]:

commands.print_my_features()

We can also pass additional parameters to print_my_features to filter by code fragments or to see a certain type of metric, if available.

In [ ]:

commands.print_my_features(code_fragment="""dataset["users"]["name"]""")

To see detailed documentation, try using Jupyter Notebook’s built-in documentation system by appending a ? to the end of a method name.

In [ ]:

commands.print_my_features?

The second method, discover_features, prints features that other participants have registered to the Feature database. This allows you to discover code that has already been written, so you can either avoid duplicating work or come up with new ideas.

In [ ]:

commands.discover_features()

The same additional parameters, code_fragment and metric_name, are available.

In [ ]:

commands.discover_features(code_fragment="""fillna(""")

Write a new feature¶

(back to top)

FeatureHub asks you to observe some rudimentary scaffolding when you write a new feature.

Your feature is a function that should

✓ accept a single parameter, dataset
✓ return a single column of values
- that has as many rows as there are entities in the dataset
- that is ordered in the same way as the entities (that is, don’t sort your feature values!)
- that can be coerced to a DataFrame
✓ be defined in the global scope
✓ import all required modules that it requires within the function body

Your feature should not

✗ modify the underlying dataset
✗ use other variables or external module members defined at the global scope (see below)

Skip below to read more about good and bad features in FeatureHub.

Here is one good example of a feature. You can execute the feature right away to see what it returns.

In [ ]:

def hi_lo_age(dataset):
    from sklearn.preprocessing import binarize
    cutoff = 30
    return binarize(dataset["users"]["age"].values.reshape(-1,1), cutoff)

hi_lo_age(dataset)

Evaluate a feature on training data¶

(back to top)

Now that we have written a candidate feature, we can evaluate it on training data. The evaluation routine proceeds as follows.

Obtains a valid dataset. That is, if the dataset has been modified, it is reloaded.
Extracts features. That is, your function is called with the dataset as its parameter, returning a column of values.
Verifies the integrity of the dataset, in that it was not changed by executing the feature.
Validates feature values, to ensure they meet the requirements listed above.
Builds full feature matrix, by combining extracted feature values with pre-processed entity features.
Fits model and computes metrics. Given the task (classification, regression, etc.), a model is chosen and fit given the full feature matrix. Then, appropriate metrics are computed via cross-validation and displayed.

In your workflow, you may run the evaluate function several times. At first, it may reveal bugs or syntax errors that you will fix. Next, it may reveal that your feature did not meet some of the FeatureHub requirements, such as returning a single column of values or using function-scope imports. Finally, you may find that your feature’s performance, in terms of metrics like classification accuracy or mean squared error, are not as good as you hoped, and you may modify it or jettison it altogether.

The evaluate function takes a single argument: the candidate feature.

In [ ]:

commands.evaluate(hi_lo_age)

Submit a feature to Feature database¶

(back to top)

Now that you have evaluated your feature locally on training data, and are happy with its performance, you can submit it to the Feature Evaluation Server. The evaluation server will essentially repeat the steps in evaluate, with some slight changes. For example, it fits the model on the training dataset and evaluates it on the test dataset, without performing cross-validation.

During submission, you are asked to write a natural language description (in English) of your feature. Imagine that you are explaining your code to a non-technical colleague. This description should be - clear - concise - informative to a domain expert who is not a data scientist - accurate (in that your description matches what the code actually does)

You will be prompted to type in a description to a textbox input. Alternately, you can pass a string using the keyword argument description.

If there are no issues, the feature and its associated performance metrics are added to the Feature database (“registered”).

The feature may also be posted automatically to the FeatureHub forum (not applicable for this Tutorial). A link will be printed, and you can navigate directly to your post to join the conversation.

In [ ]:

commands.submit(hi_lo_age)

Final notes¶

As briefly mentioned in the section on feature evaluation, FeatureHub may combine your feature values with preproccessed entity-level features, if applicable to the problem. Problem descriptions will describe the transformations done during preprocessing. You can also inspect the entity-level features directly.

In [ ]:

entity_features = commands.get_entity_features()
entity_features.head()

More feature scaffolding¶

(back to top) (back to writing features)

Let’s take a closer look at how good and bad features are written in FeatureHub. Here, we are just talking about the scaffolding of the feature, and not whether the feature has predictive power in a machine learning model or any semantic meaning.

Here are some examples of good and bad features, with explanations:

# good - one parameter, imports numpy within function scope,
#        returns column of values of the right shape.
def all_zeros(dataset):
    from numpy import zeros
    n = len(dataset["users"])
    return zeros((n,1))

# bad - wrong number of parameters
def two_parameters(users, groups):
    return users["age"]

# bad - return value cannot be coerced to DataFrame
def scalar_zero(dataset):
    return 0

# bad - return value is not correct shape
def row_of_zeros(dataset):
    from numpy import zeros
    return zeros((20,20))

# bad - modifies underlying dataset!
def modify_dataset(dataset):
    dataset["users"].iloc[0,0] += 1
    return None

Don’t use global variables¶

Your feature will be evaluated by FeatureHub in an isolated namespace. This means that your feature cannot expect variables or modules that you have defined at the global scope to exist. By “global scope”, we mean variables or modules that are defined outside of your function definition.

Similarly, all module imports should be done within your function. In the next section, we’ll demonstrate an easy-to-use workflow that gets around this limitation.

This feature is invalid, because it uses a variable, cutoff, defined at the global scope (outside of the function definition):

In [ ]:

# bad
cutoff = 30
def hi_lo_age(dataset):
    from sklearn.preprocessing import binarize
    return binarize(dataset["users"]["age"].values.reshape(-1,1), cutoff)

This feature is okay, because it moves the variable into the function definition:

In [ ]:

# better
def hi_lo_age(dataset):
    from sklearn.preprocessing import binarize
    cutoff = 30
    return binarize(dataset["users"]["age"].values.reshape(-1,1), cutoff)

One exception to this requirement is that you can use helper functions that you define at the global scope:

In [ ]:

def first_name_is_longer(name):
    first, last = name.split(" ")
    return len(first) > len(last)

# okay
def long_first_name(dataset):
    return dataset["users"]["name"].apply(first_name_is_longer)

Use a helper function for imports¶

You might want to import the same set of modules in many different features without re-typing them. You can use a helper function to import them, while still avoiding using global variables.

In the following example, we define a function imports that declares pd and np as global variables and then imports the corresponding libraries. This function can then be called in any feature to make those libraries available.

In [ ]:

def imports():
    global pd, np
    import pandas as pd
    import numpy as np

def age_with_random_noise(dataset):
    imports()

    n = len(dataset["users"])
    noise = np.random.rand(n)
    return dataset["users"] + noise

Getting help¶

(back to top)

If you need further help, there are several resources available.

Check out the FeatureHub User Guide
Check out the FeatureHub FAQ
Ask for help on the FeatureHub forum