Skip to main content
Version: Next

Reshaping Transformations

This section covers transformations that change the shape or structure of your data: pivoting, unpivoting, flattening nested structures, conditional column creation, and splitting datasets.


Pivot (8.16)

Rotates rows into columns (long to wide format). Distinct values from a chosen column become new column headers, with aggregated values filling the cells.

Schema:

FieldTypeRequiredDefaultDescription
fromAssetRefYesSource asset.
groupByNonEmptyList[Column]YesColumns to group by (these remain as rows).
pivotColumnColumnYesColumn whose distinct values become new columns.
valuesList[string]Noall distinctSpecific values to pivot on.
aggNonEmptyList[Expression]YesAggregate expressions applied per pivot value.

Example — sales by region and product:

transformation:
- name: salesPivot
pivot:
from: sales
groupBy: [region]
pivotColumn: product
values: ["A", "B", "C"]
agg:
- "sum(amount)"

Given input like:

regionproductamount
EastA100
EastB200
WestA150

The output becomes:

regionABC
East100200NULL
West150NULLNULL

Tip: Specify values explicitly when you know the expected pivot values. If omitted, all distinct values are used, but this requires an extra pass over the data and may produce unpredictable column ordering.


Unpivot (8.17)

Rotates columns into rows (wide to long format). The inverse of pivot.

Schema:

FieldTypeRequiredDescription
fromAssetRefYesSource asset.
idsNonEmptyList[Column]YesColumns to keep as identifiers.
valuesNonEmptyList[Column]YesColumns to unpivot into rows.
variableColumnstringYesName for the column holding original column names.
valueColumnstringYesName for the column holding the values.

Example — unpivot quarterly sales columns:

transformation:
- name: longFormat
unpivot:
from: quarterlySales
ids: [region, product]
values: [q1_sales, q2_sales, q3_sales, q4_sales]
variableColumn: quarter
valueColumn: sales_amount

Given input:

regionproductq1_salesq2_salesq3_salesq4_sales
EastWidget10012090150

The output becomes:

regionproductquartersales_amount
EastWidgetq1_sales100
EastWidgetq2_sales120
EastWidgetq3_sales90
EastWidgetq4_sales150

Flatten (8.18)

Flattens nested structures (structs and optionally arrays) into a flat schema.

Schema:

FieldTypeRequiredDefaultDescription
fromAssetRefYesSource asset.
separatorstringNo"_"Separator between parent and child field names.
explodeArraysbooleanNofalseIf true, array fields produce multiple rows.

Example — flatten nested JSON data:

transformation:
- name: flat
flatten:
from: nestedJson
separator: "_"
explodeArrays: false

Given a struct field address with subfields city and zip, the output columns become address_city and address_zip.

Semantics:

  • Flattening is recursive — nested structs within nested structs are fully flattened.
  • When explodeArrays is true, a row with an array of N elements produces N rows. Empty or NULL arrays cause the row to be dropped.

Example — flatten and explode arrays:

transformation:
- name: exploded
flatten:
from: ordersWithItems
separator: "_"
explodeArrays: true

Tip: If your data has deeply nested JSON or Parquet structs, flatten first and then use select or dropColumns to keep only the fields you need.


Conditional (8.20)

Adds a column with values determined by conditional logic, equivalent to SQL CASE WHEN.

Schema:

FieldTypeRequiredDefaultDescription
fromAssetRefYesSource asset.
outputColumnstringYesName for the new column.
branchesNonEmptyList[Branch]YesConditions evaluated in order.
otherwiseExpressionNoNULLDefault value if no branch matches.

Branch object:

FieldTypeRequiredDescription
conditionConditionYesBoolean condition.
valueExpressionYesValue if the condition is true.

Branches are evaluated in order. The first matching condition determines the value.

Example — categorize orders by size:

transformation:
- name: categorized
conditional:
from: orders
outputColumn: size
branches:
- condition: "amount > 1000"
value: "'large'"
- condition: "amount > 100"
value: "'medium'"
otherwise: "'small'"

Example — business logic with multiple conditions:

transformation:
- name: riskScored
conditional:
from: transactions
outputColumn: risk_level
branches:
- condition: "amount > 10000 AND country != 'US'"
value: "'high'"
- condition: "amount > 5000 OR previous_fraud = true"
value: "'medium'"
- condition: "account_age_days < 30"
value: "'review'"
otherwise: "'low'"

Tip: String literal values in value must be quoted within the expression (e.g., "'large'"). Without inner quotes, the runtime interprets it as a column reference.


Split (8.21)

Splits a dataset into two named assets based on a boolean condition. This is the only transformation that produces two output assets.

Schema:

FieldTypeRequiredDescription
fromAssetRefYesSource asset.
conditionConditionYesBoolean condition.
passAssetRefYesName for rows where condition is true.
failAssetRefYesName for rows where condition is false or NULL.

The name of the split transformation itself is not a referenceable asset. Only pass and fail are referenceable by downstream transformations. Both names must be unique in the document namespace.

Rows where condition evaluates to NULL go to fail.

Example — split API results by status:

transformation:
- name: splitByStatus
split:
from: apiResults
condition: "status_code = 200"
pass: successRecords
fail: failedRecords

# Use the pass output downstream
- name: successSummary
group:
from: successRecords
by: [endpoint]
agg: ["count(1) as calls"]

# Use the fail output downstream
- name: errorReport
select:
from: failedRecords
columns:
- endpoint
- status_code
- error_message

Tip: Split is useful for data quality workflows where you want to route valid and invalid records to separate processing paths.