Enumeration-based Cleaner

class EnumCleaner(cleaning_definitions={})[source]

Bases: spooq2.transformer.transformer.Transformer

Cleanses a dataframe based on lists of allowed|disallowed values.

Example

>>> transformer = EnumCleaner(
>>>     cleaning_definitions={
>>>         "status": {
>>>             "elements": ["active", "inactive"],
>>>         },
>>>         "version": {
>>>             "elements": ["", "None", "none", "null", "NULL"],
>>>             "mode": "disallow",
>>>             "default": None
>>>         },
>>>     }
>>> )
Parameters:cleaning_definitions (dict) – Dictionary containing column names and respective cleansing rules

Note

Following cleansing rule attributes per column are supported:

  • elements, mandatory - list
    A list of elements which will be used to allow or reject (based on mode) values from the input DataFrame.
  • mode, allow|disallow, defaults to ‘allow’ - str
    “allow” will set all values which are NOT in the list (ignoring NULL) to the default value. “disallow” will set all values which ARE in the list (ignoring NULL) to the default value.
  • default, defaults to None - Column or any primitive Python value
    If a value gets cleansed it gets replaced with the provided default value.
Returns:

The transformed DataFrame

Return type:

pyspark.sql.DataFrame

Raises:
  • exceptions.ValueError – Enumeration-based cleaning requires a non-empty list of elements per cleaning rule! Spooq did not find such a list for column: {column_name}
  • exceptions.ValueError – Only the following modes are supported by EnumCleaner: ‘allow’ and ‘disallow’.

Warning

None values are explicitly ignored as input values because F.lit(None).isin([“elem1”, “elem2”]) will neither return True nor False but None. If you want to replace Null values you should use the method ~pyspark.sql.DataFrame.fillna from Spark.

transform(input_df)[source]

Performs a transformation on a DataFrame.

Parameters:input_df (pyspark.sql.DataFrame) – Input DataFrame
Returns:Transformed DataFrame.
Return type:pyspark.sql.DataFrame

Note

This method does only take the Input DataFrame as a parameters. All other needed parameters are defined in the initialization of the Transformator Object.