Newest by Group (Most current record per ID)
- class NewestByGroup(group_by=['id'], order_by=['updated_at', 'deleted_at'])[source]
Groups, orders and selects first element per group.
Example
>>> transformer = NewestByGroup( >>> group_by=["first_name", "last_name"], >>> order_by=["created_at_ms", "version"] >>> )
- Parameters
group_by (
str
orlist
ofstr
, (Defaults to [‘id’])) – List of attributes to be used within the Window Function as Grouping Arguments.order_by (
str
orlist
ofstr
, (Defaults to [‘updated_at’, ‘deleted_at’])) – List of attributes to be used within the Window Function as Ordering Arguments. All columns will be sorted in descending order.
- Raises
exceptions.AttributeError – If any Attribute in
group_by
ororder_by
is not contained in the input DataFrame.
Note
PySpark’s
Window
function is used internally The first row (row_number()
) per window will be selected and returned.- transform(input_df)[source]
Performs a transformation on a DataFrame.
- Parameters
input_df (
DataFrame
) – Input DataFrame- Returns
Transformed DataFrame.
- Return type
Note
This method does only take the Input DataFrame as a parameters. Any other needed parameters are defined in the initialization of the Transformator Object.