Skip to main content
Version: 2.0

Filtering and Row Selection

This section covers transformations that project columns, filter rows, deduplicate, limit, and sample datasets.


Select (8.1)

Projects specific columns or expressions from a dataset. The output contains only the listed columns, in the order specified.

Schema:

FieldTypeRequiredDescription
fromAssetRefYesSource asset.
columnsNonEmptyList[Expression]YesColumn names or expressions to include.

Each entry in columns can be a bare column name or an expression, optionally aliased with as.

Example — basic projection with computed columns:

transformation:
- name: projected
select:
from: employees
columns:
- id
- name
- "salary * 1.1 as adjusted_salary"
- "upper(department) as dept"

Tip: If a column name does not exist in the source, the runtime raises E-COL-001. Double-check your source schema.


Where / Filter (8.2)

Filters rows based on a boolean condition. Only rows where the condition evaluates to true are kept.

Schema:

FieldTypeRequiredDescription
fromAssetRefYesSource asset.
filterConditionYesBoolean expression for row selection.

Rows where filter evaluates to NULL are excluded — NULL is not truthy.

Example — filtering active users by date:

transformation:
- name: activeUsers
where:
from: users
filter: "status = 'active' AND created_at >= '2025-01-01'"

Example — numeric filter:

transformation:
- name: highValueOrders
where:
from: orders
filter: "amount > 500"

Distinct (8.9)

Removes duplicate rows from a dataset.

Schema:

FieldTypeRequiredDefaultDescription
fromAssetRefYesSource asset.
columnsList[Column]Noall columnsSubset of columns for deduplication.

When columns is omitted, all columns are used. When specified, one arbitrary row per group is kept — there is no guarantee which row is retained.

Tip: If you need deterministic deduplication, use orderBy followed by a window function with row_number() and then filter.

Example — full row deduplication:

transformation:
- name: uniqueEvents
distinct:
from: rawEvents

Example — deduplicate by specific columns:

transformation:
- name: latestPerUser
distinct:
from: userActions
columns:
- user_id
- action_type

NULL values are considered equal for deduplication purposes.


Limit (8.10)

Returns at most N rows from a dataset.

Schema:

FieldTypeRequiredDescription
fromAssetRefYesSource asset.
countintegerYesMaximum rows to return. Must be >= 0.

Setting count: 0 produces an empty dataset with the same schema.

Tip: Without a preceding orderBy, the selected rows are non-deterministic. Combine with orderBy for predictable results.

Example — top 100 rows after sorting:

transformation:
- name: sorted
orderBy:
from: sales
columns:
- column: amount
direction: desc

- name: top100
limit:
from: sorted
count: 100

Sample (8.19)

Returns a random sample of rows from a dataset.

Schema:

FieldTypeRequiredDefaultDescription
fromAssetRefYesSource asset.
fractiondoubleYesSampling fraction in (0.0, 1.0].
withReplacementbooleanNofalseWhether to sample with replacement.
seedintegerNo(random)Random seed for reproducibility.

Example — 10% sample for development:

transformation:
- name: devSample
sample:
from: production_data
fraction: 0.1
seed: 42

Example — sample with replacement:

transformation:
- name: bootstrapSample
sample:
from: training_data
fraction: 0.5
withReplacement: true
seed: 12345

Tip: Use the seed parameter to get reproducible samples across pipeline runs. Without it, each run produces different results.