Newest by Group (Most current record per ID)

class NewestByGroup(group_by=['id'], order_by=['updated_at', 'deleted_at'])[source]

Bases: spooq2.transformer.transformer.Transformer

Groups, orders and selects first element per group.

Example

>>> transformer = NewestByGroup(
>>>     group_by=["first_name", "last_name"],
>>>     order_by=["created_at_ms", "version"]
>>> )
Parameters:
  • group_by (str or list of str, (Defaults to [‘id’])) – List of attributes to be used within the Window Function as Grouping Arguments.
  • order_by (str or list of str, (Defaults to [‘updated_at’, ‘deleted_at’])) – List of attributes to be used within the Window Function as Ordering Arguments. All columns will be sorted in descending order.
Raises:

exceptions.AttributeError – If any Attribute in group_by or order_by is not contained in the input DataFrame.

Note

PySpark’s Window function is used internally The first row (row_number()) per window will be selected and returned.

transform(input_df)[source]

Performs a transformation on a DataFrame.

Parameters:input_df (pyspark.sql.DataFrame) – Input DataFrame
Returns:Transformed DataFrame.
Return type:pyspark.sql.DataFrame

Note

This method does only take the Input DataFrame as a parameters. All other needed parameters are defined in the initialization of the Transformator Object.