Offline Features

To build offline features for training, we will need an label datasource and the list of feature names you're interested in computing.

The label datasource is the spine of the dataset which includes the events you're interested (for example, click events, or item bought events), and the timestamp of when the event occurred. This is crucial so that Glacius can perform a point-in-time join to compute what the features were for a given entity at that specific point in time.

1. Define the label data source.

main.py

label_datasource = SnowflakeSource(
    name = "user click data snowflake",
    description = "user click data in snowflake",
    timestamp_col = "timestamp",
    table = "OBSERVATIONS_BENCHMARK_M",
    database = "user_click_data",
    schema="public"
)

Triggering Offline Job Via the Registry

If you are triggering jobs via the registry, you'll need to specify which namespace version you'd like to use. By default it will use the latest version of the namespace. This ensures backwards compatibility for production pipelines.

main.py

job = client.get_offline_features(
    feature_names = [f.name for f in user_bundle.features], 
    labels_datasource=label_datasource, 
    output_path="s3://my-s3-bucket/offline_features_test_job", 
    namespace_version="latest", 
)

Triggering Ad Hoc Offline Job

If you'd just like to trigger the job from the feature bundles you've just defined in the notebook, you can pass them in directly as well for ad hoc runs.

main.py

job = client.get_offline_features(
    feature_bundles=[user_bundle]
    labels_datasource=label_datasource, 
    output_path="s3://my-s3-bucket/offline_features_test_job", 
)

Defining and Registering Features Online Materialization and Realtime Serving