Newest by Group (Most current record per ID)

class NewestByGroup(group_by=['id'], order_by=['updated_at', 'deleted_at'])[source]

Groups, orders and selects first element per group.

Example

>>> transformer = NewestByGroup(
>>>     group_by=["first_name", "last_name"],
>>>     order_by=["created_at_ms", "version"]
>>> )
Parameters
  • group_by (str or list of str, (Defaults to [‘id’])) – List of attributes to be used within the Window Function as Grouping Arguments.

  • order_by (str or list of str, (Defaults to [‘updated_at’, ‘deleted_at’])) – List of attributes to be used within the Window Function as Ordering Arguments. All columns will be sorted in descending order.

Raises

exceptions.AttributeError – If any Attribute in group_by or order_by is not contained in the input DataFrame.

Note

PySpark’s Window function is used internally The first row (row_number()) per window will be selected and returned.

transform(input_df)[source]

Performs a transformation on a DataFrame.

Parameters

input_df (DataFrame) – Input DataFrame

Returns

Transformed DataFrame.

Return type

DataFrame

Note

This method does only take the Input DataFrame as a parameters. Any other needed parameters are defined in the initialization of the Transformator Object.