Filtering and Row Selection
This section covers transformations that project columns, filter rows, deduplicate, limit, and sample datasets.
Select (8.1)
Projects specific columns or expressions from a dataset. The output contains only the listed columns, in the order specified.
Schema:
| Field | Type | Required | Description |
|---|---|---|---|
from | AssetRef | Yes | Source asset. |
columns | NonEmptyList[Expression] | Yes | Column names or expressions to include. |
Each entry in columns can be a bare column name or an expression, optionally aliased with as.
Example — basic projection with computed columns:
transformation:
- name: projected
select:
from: employees
columns:
- id
- name
- "salary * 1.1 as adjusted_salary"
- "upper(department) as dept"
Tip: If a column name does not exist in the source, the runtime raises
E-COL-001. Double-check your source schema.
Where / Filter (8.2)
Filters rows based on a boolean condition. Only rows where the condition evaluates to true are kept.
Schema:
| Field | Type | Required | Description |
|---|---|---|---|
from | AssetRef | Yes | Source asset. |
filter | Condition | Yes | Boolean expression for row selection. |
Rows where filter evaluates to NULL are excluded — NULL is not truthy.
Example — filtering active users by date:
transformation:
- name: activeUsers
where:
from: users
filter: "status = 'active' AND created_at >= '2025-01-01'"
Example — numeric filter:
transformation:
- name: highValueOrders
where:
from: orders
filter: "amount > 500"
Distinct (8.9)
Removes duplicate rows from a dataset.
Schema:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
from | AssetRef | Yes | — | Source asset. |
columns | List[Column] | No | all columns | Subset of columns for deduplication. |
When columns is omitted, all columns are used. When specified, one arbitrary row per group is kept — there is no guarantee which row is retained.
Tip: If you need deterministic deduplication, use
orderByfollowed by awindowfunction withrow_number()and then filter.
Example — full row deduplication:
transformation:
- name: uniqueEvents
distinct:
from: rawEvents
Example — deduplicate by specific columns:
transformation:
- name: latestPerUser
distinct:
from: userActions
columns:
- user_id
- action_type
NULL values are considered equal for deduplication purposes.
Limit (8.10)
Returns at most N rows from a dataset.
Schema:
| Field | Type | Required | Description |
|---|---|---|---|
from | AssetRef | Yes | Source asset. |
count | integer | Yes | Maximum rows to return. Must be >= 0. |
Setting count: 0 produces an empty dataset with the same schema.
Tip: Without a preceding
orderBy, the selected rows are non-deterministic. Combine withorderByfor predictable results.
Example — top 100 rows after sorting:
transformation:
- name: sorted
orderBy:
from: sales
columns:
- column: amount
direction: desc
- name: top100
limit:
from: sorted
count: 100
Sample (8.19)
Returns a random sample of rows from a dataset.
Schema:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
from | AssetRef | Yes | — | Source asset. |
fraction | double | Yes | — | Sampling fraction in (0.0, 1.0]. |
withReplacement | boolean | No | false | Whether to sample with replacement. |
seed | integer | No | (random) | Random seed for reproducibility. |
Example — 10% sample for development:
transformation:
- name: devSample
sample:
from: production_data
fraction: 0.1
seed: 42
Example — sample with replacement:
transformation:
- name: bootstrapSample
sample:
from: training_data
fraction: 0.5
withReplacement: true
seed: 12345
Tip: Use the
seedparameter to get reproducible samples across pipeline runs. Without it, each run produces different results.