Column Operations
This section covers transformations that add, remove, rename, or change the type of columns in a dataset.
Add Columns (8.11)
Adds one or more computed columns to a dataset.
Schema:
| Field | Type | Required | Description |
|---|---|---|---|
from | AssetRef | Yes | Source asset. |
columns | NonEmptyList[ColumnDef] | Yes | Columns to add. |
ColumnDef object:
| Field | Type | Required | Description |
|---|---|---|---|
name | string | Yes | Name of the new column. |
expression | Expression | Yes | Expression to compute the column value. |
Example — derived columns:
transformation:
- name: withDerived
addColumns:
from: orders
columns:
- name: total
expression: "quantity * unit_price"
- name: processed_at
expression: "current_timestamp()"
- name: category
expression: "case when amount > 1000 then 'high' else 'low' end"
Key behaviors:
- If a column with the same name already exists, it is replaced (overwritten).
- Expressions can reference existing columns and previously added columns in the same
addColumnslist (evaluated in order).
Tip: Use
addColumnsto build derived fields step by step. Since columns are evaluated in order, later columns can reference earlier ones defined in the same list.
Drop Columns (8.12)
Removes columns from a dataset.
Schema:
| Field | Type | Required | Description |
|---|---|---|---|
from | AssetRef | Yes | Source asset. |
columns | NonEmptyList[Column] | Yes | Columns to remove. |
Example — remove sensitive fields before output:
transformation:
- name: sanitized
dropColumns:
from: customers
columns:
- ssn
- credit_card_number
- date_of_birth
Constraints:
- If a named column does not exist, the runtime raises
E-COL-001. - Dropping all columns is an error (
E-SCHEMA-002) — at least one column must remain.
Rename Columns (8.13)
Renames columns using a mapping of old names to new names.
Schema:
| Field | Type | Required | Description |
|---|---|---|---|
from | AssetRef | Yes | Source asset. |
mappings | Map[Column, string] | Yes | Old name to new name mapping. At least one entry. |
Example — standardize column names:
transformation:
- name: renamed
renameColumns:
from: rawData
mappings:
first_name: firstName
last_name: lastName
e_mail: email
phone_number: phone
Constraints:
- If an old name does not exist, the runtime raises
E-COL-001. - If a new name collides with an existing (unrenamed) column, the runtime raises
E-NAME-003.
Tip: Rename is useful after joins to resolve qualified column names like
employees.nameinto cleaner names.
Cast Columns (8.14)
Changes the data type of one or more columns.
Schema:
| Field | Type | Required | Description |
|---|---|---|---|
from | AssetRef | Yes | Source asset. |
columns | NonEmptyList[CastDef] | Yes | Cast definitions. |
CastDef object:
| Field | Type | Required | Description |
|---|---|---|---|
name | Column | Yes | Column to cast. |
targetType | string | Yes | Target data type (see Data Types section). |
Example — cast string columns to proper types:
transformation:
- name: typed
castColumns:
from: csvImport
columns:
- name: age
targetType: integer
- name: salary
targetType: double
- name: hire_date
targetType: date
- name: is_active
targetType: boolean
Key behaviors:
- If a value cannot be cast (e.g.,
"abc"tointeger), the value becomesNULL. The pipeline does not fail for individual cast failures. - Type names are case-insensitive:
Integer,INTEGER, andintegerare all equivalent.
Tip: To enforce strict casting where invalid values should fail the pipeline, follow
castColumnswith anassertiontransformation that checks for unexpected NULLs.
Fill NA (8.34)
New in v3.0.
Replaces NULL values in specified columns with a fill value.
Schema:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
from | AssetRef | Yes | -- | Source asset. |
columns | List[Column] | No | all | Columns to fill. |
value | primitive | Conditional | -- | Single fill value for all target columns. |
values | Map[Column, primitive] | Conditional | -- | Per-column fill values. |
Exactly one of value or values must be provided. Providing both or neither raises E-OP-004.
Example -- fill all NULLs with a single value:
transformation:
- name: filled
fillNa:
from: metrics
value: 0
Example -- per-column fill values:
transformation:
- name: filled
fillNa:
from: users
values:
name: "unknown"
age: 0
active: false
Tip: Use
fillNaas a cleaner alternative to multipleaddColumnswithcoalesceexpressions when you need to replace NULLs with defaults.
Replace (8.36)
New in v3.0.
Replaces specific values in specified columns with new values.
Schema:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
from | AssetRef | Yes | -- | Source asset. |
columns | List[Column] | No | all | Columns to apply replacements. |
mappings | NonEmptyList[Replacement] | Yes | -- | Old to new value pairs. |
Replacement object:
| Field | Type | Required | Description |
|---|---|---|---|
old | primitive | Yes | Value to match. |
new | primitive | Yes | Replacement value. |
Example -- normalize status values:
transformation:
- name: cleaned
replace:
from: orders
columns: [status]
mappings:
- old: "N/A"
new: "unknown"
- old: ""
new: "unknown"
- old: "cancelled"
new: "canceled"
Example -- replace values across all columns:
transformation:
- name: sanitized
replace:
from: survey
mappings:
- old: -999
new: 0
Tip:
replaceis useful for data cleaning tasks where specific sentinel values need to be normalized. For NULL replacement, usefillNainstead.
Parse (8.38)
New in v3.0.
Parses string columns containing structured data (CSV or JSON) into typed columns.
Schema:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
from | AssetRef | Yes | -- | Source asset. |
column | Column | Yes | -- | String column to parse. |
format | "json" or "csv" | Yes | -- | Parse format. |
schema | List[SchemaColumn] | No | (inferred) | Expected schema for parsed data. |
options | Map[string, string] | No | {} | Format-specific options. |
The schema entries use the same format as schemaEnforce columns (with name and dataType fields).
Example -- parse JSON payload into typed columns:
transformation:
- name: parsedEvents
parse:
from: rawLogs
column: payload
format: json
schema:
- name: event_type
dataType: string
- name: timestamp
dataType: timestamp
- name: metadata
dataType: "map<string, string>"
Example -- parse embedded CSV data:
transformation:
- name: parsedCSV
parse:
from: rawData
column: csv_field
format: csv
options:
delimiter: "|"
header: "true"
Key behaviors:
- The parsed column is replaced by the structured result columns.
- Malformed values produce NULL for the affected fields.
- If
schemais omitted, the implementation attempts schema inference.
Tip: Always provide
schemain production pipelines for predictable results. Schema inference can be unreliable with inconsistent data.