Filtering and Row Selection
This section covers transformations that project columns, filter rows, deduplicate, limit, and sample datasets.
Select (8.1)
Projects specific columns or expressions from a dataset. The output contains only the listed columns, in the order specified.
Schema:
| Field | Type | Required | Description |
|---|---|---|---|
from | AssetRef | Yes | Source asset. |
columns | NonEmptyList[Expression] | Yes | Column names or expressions to include. |
Each entry in columns can be a bare column name or an expression, optionally aliased with as.
Example — basic projection with computed columns:
transformation:
- name: projected
select:
from: employees
columns:
- id
- name
- "salary * 1.1 as adjusted_salary"
- "upper(department) as dept"
Tip: If a column name does not exist in the source, the runtime raises
E-COL-001. Double-check your source schema.
Where / Filter (8.2)
Filters rows based on a boolean condition. Only rows where the condition evaluates to true are kept.
Schema:
| Field | Type | Required | Description |
|---|---|---|---|
from | AssetRef | Yes | Source asset. |
filter | Condition | Yes | Boolean expression for row selection. |
Rows where filter evaluates to NULL are excluded — NULL is not truthy.
Example — filtering active users by date:
transformation:
- name: activeUsers
where:
from: users
filter: "status = 'active' AND created_at >= '2025-01-01'"
Example — numeric filter:
transformation:
- name: highValueOrders
where:
from: orders
filter: "amount > 500"
Distinct (8.9)
Removes duplicate rows from a dataset.
Schema:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
from | AssetRef | Yes | — | Source asset. |
columns | List[Column] | No | all columns | Subset of columns for deduplication. |
When columns is omitted, all columns are used. When specified, one arbitrary row per group is kept — there is no guarantee which row is retained.
Tip: If you need deterministic deduplication, use
orderByfollowed by awindowfunction withrow_number()and then filter.
Example — full row deduplication:
transformation:
- name: uniqueEvents
distinct:
from: rawEvents
Example — deduplicate by specific columns:
transformation:
- name: latestPerUser
distinct:
from: userActions
columns:
- user_id
- action_type
NULL values are considered equal for deduplication purposes.
Limit (8.10)
Returns at most N rows from a dataset.
Schema:
| Field | Type | Required | Description |
|---|---|---|---|
from | AssetRef | Yes | Source asset. |
count | integer | Yes | Maximum rows to return. Must be >= 0. |
Setting count: 0 produces an empty dataset with the same schema.
Tip: Without a preceding
orderBy, the selected rows are non-deterministic. Combine withorderByfor predictable results.
Example — top 100 rows after sorting:
transformation:
- name: sorted
orderBy:
from: sales
columns:
- column: amount
direction: desc
- name: top100
limit:
from: sorted
count: 100
Sample (8.19)
Returns a random sample of rows from a dataset.
Schema:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
from | AssetRef | Yes | — | Source asset. |
fraction | double | Yes | — | Sampling fraction in (0.0, 1.0]. |
withReplacement | boolean | No | false | Whether to sample with replacement. |
seed | integer | No | (random) | Random seed for reproducibility. |
Example — 10% sample for development:
transformation:
- name: devSample
sample:
from: production_data
fraction: 0.1
seed: 42
Example — sample with replacement:
transformation:
- name: bootstrapSample
sample:
from: training_data
fraction: 0.5
withReplacement: true
seed: 12345
Tip: Use the
seedparameter to get reproducible samples across pipeline runs. Without it, each run produces different results.
Offset (8.32)
New in v3.0.
Skips the first N rows from a dataset.
Schema:
| Field | Type | Required | Description |
|---|---|---|---|
from | AssetRef | Yes | Source asset. |
count | integer | Yes | Rows to skip. Must be >= 0. |
Setting count: 0 is a no-op.
Tip: Without a preceding
orderBy, the skipped rows are non-deterministic. Combine withorderByfor predictable results.
Example -- skip the first 50 rows:
transformation:
- name: sorted
orderBy:
from: events
columns:
- column: event_time
direction: asc
- name: skipped
offset:
from: sorted
count: 50
Example -- pagination with offset and limit:
transformation:
- name: page3
offset:
from: sorted
count: 200
- name: page3Limited
limit:
from: page3
count: 100
Tail (8.33)
New in v3.0.
Returns the last N rows from a dataset.
Schema:
| Field | Type | Required | Description |
|---|---|---|---|
from | AssetRef | Yes | Source asset. |
count | integer | Yes | Maximum rows from the end. Must be >= 0. |
Setting count: 0 produces an empty dataset with the same schema.
Tip:
tailrequires materializing the entire input, so it may have performance implications on large datasets. PreferorderBy+limitwhen possible.
Example -- last 10 events:
transformation:
- name: sorted
orderBy:
from: events
columns:
- column: event_time
direction: asc
- name: latestEvents
tail:
from: sorted
count: 10
Drop NA (8.35)
New in v3.0.
Drops rows with NULL values.
Schema:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
from | AssetRef | Yes | -- | Source asset. |
columns | List[Column] | No | all | Columns to check for NULLs. |
how | "any" or "all" | No | "any" | Drop if any/all target columns are NULL. |
minNonNulls | integer | No | -- | Keep row only if >= N non-NULL values. Overrides how. |
Example -- drop rows where any of the key columns are NULL:
transformation:
- name: complete
dropNa:
from: users
how: any
columns: [name, email]
Example -- keep rows with at least 3 non-NULL values:
transformation:
- name: sufficient
dropNa:
from: survey_responses
minNonNulls: 3
Tip: Use
how: "all"to only drop rows where every checked column is NULL. This is less aggressive than the default"any"mode.