Feature
A feature in glacius represents a transformation on an existing column in a data source. This makes it extremely flexible to experiment and create new features without updating data pipelines.
Features in Glacius have two main components:
- Row wise expression: Determines how each row's data is transformed based on certain conditions. See DSL for the full spec on row wise expressions
- Aggregation: Specifies how transformed data should be aggregated over a defined window of time. The current supported aggregations are [SUM, AVG, MIN, MAX, LATEST], but we are currently in the process of adding more!
from datetime import timedelta
from glacius import Feature, Aggregation, AggregationType, when, col
Feature(
name = "total_items_clicked_category_5d",
description = "Total items clicked per category over 5 days",
expr = when(col("product_category") == "fashion_apparel").then(col("item_click")).otherwise(0),
dtype = "Int32",
agg = Aggregation(method=AggregationType.SUM, window=timedelta(days=5))
)
This feature, named total_items_clicked_category_5d, calculates the total number of items clicked in a specific category over a 5-day period. The expr field uses a conditional expression to filter clicks by product category, counting clicks (item_click) if the product belongs to the specified category and ignoring them (counting as 0) otherwise. The aggregation is done by summing up these values over a 5-day window.
Note: the feature expression is always computed before the aggregation! For more on feature expressions, see the section below.
Feature Expressions in Glacius
Feature expressions in Glacius represent row level transformations to data sources. All expressions compile down to SparkSQL. The reason we have our own DSL instead of using pure SparkSQL is to have compile time checks that enforce correctness. This helps us catch errors during compile time instead of runtime which can save resources.
Conditional Logic with when
Apply conditional logic to perform row-wise data transformations. Use when
to specify conditions under which transformations should occur.
when(col("product_category") == "electronics").then(col("sales_amount")).otherwise(0)
Arithmetic Operations
Perform basic arithmetic operations on columns to calculate new values.
addition
add(col("price"), col("tax"))
Subtraction
sub(col("total_price"), col("discount"))
Multiplication
mul(col("quantity"), col("unit_price"))
Division
div(col("total_sales"), col("number_of_items"))
Logical Operations
Combine conditions or expressions using logical operators.
AND (and_
)
Combines multiple conditions that must all be true.
and_(col("age") > 18, col("subscriber") == True)
OR (or_
)
Combines conditions where at least one must be true.
or_(col("state") == "CA", col("state") == "NY")
String Manipulation
Concatenate or manipulate strings to create new string values.
Concatenation (concat)
Joins two or more strings or columns into a single string.
concat(col("first_name"), ' ', col("last_name"))
Date and Time Operations
Date Difference (date_diff)
date_diff(col("end_date"), col("start_date"))