Enumeration-based Cleaner

class EnumCleaner(cleaning_definitions={}, column_to_log_cleansed_values=None, store_as_map=False)[source]

Cleanses a dataframe based on lists of allowed|disallowed values.

Examples

>>> from spooq.transformer import EnumCleaner
>>>
>>> transformer = EnumCleaner(
>>>     cleaning_definitions={
>>>         "status": {
>>>             "elements": ["active", "inactive"],
>>>         },
>>>         "version": {
>>>             "elements": ["", "None", "none", "null", "NULL"],
>>>             "mode": "disallow",
>>>             "default": None,
>>>         },
>>>     }
>>> )
>>> from spooq.transformer import EnumCleaner
>>> from pyspark.sql import Row
>>>
>>> input_df = spark.createDataFrame([
>>>     Row(a="stay", b="positive"),
>>>     Row(a="stay", b="negative"),
>>>     Row(a="stay", b="positive"),
>>> ])
>>> transformer = EnumCleaner(
>>>     cleaning_definitions={
>>>         "b": {
>>>             "elements": ["positive"],
>>>             "mode": "allow",
>>>         }
>>>     },
>>>     column_to_log_cleansed_values="cleansed_values_enum",
>>>     store_as_map=True,
>>> )
>>> output_df = transformer.transform(input_df)
>>> output_df.show()
+----+--------+--------------------+
|   a|       b|cleansed_values_enum|
+----+--------+--------------------+
|stay|positive|                  []|
|stay|    null|     [b -> negative]|
|stay|positive|                  []|
+----+--------+--------------------+
>>> output_df.printSchema()
root
 |-- a: string (nullable = true)
 |-- b: string (nullable = true)
 |-- cleansed_values_enum: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
Parameters
  • cleaning_definitions (dict) – Dictionary containing column names and respective cleansing rules

  • column_to_log_cleansed_values (str, Defaults to None) – Defines a column in which the original (uncleansed) value will be stored in case of cleansing. If no column name is given, nothing will be logged.

  • store_as_map (bool, Defaults to False) – Specifies if the logged cleansed values should be stored in a column as pyspark.sql.types.MapType with stringified values or as pyspark.sql.types.StructType with the original respective data types.

Note

Following cleansing rule attributes per column are supported:

  • elements, mandatory - list

    A list of elements which will be used to allow or reject (based on mode) values from the input DataFrame.

  • mode, allow|disallow, defaults to ‘allow’ - str

    “allow” will set all values which are NOT in the list (ignoring NULL) to the default value. “disallow” will set all values which ARE in the list (ignoring NULL) to the default value.

  • default, defaults to None - Column or any primitive Python value

    If a value gets cleansed it gets replaced with the provided default value.

Returns

The transformed DataFrame

Return type

DataFrame

Raises
  • exceptions.ValueError – Enumeration-based cleaning requires a non-empty list of elements per cleaning rule! Spooq did not find such a list for column: {column_name}

  • exceptions.ValueError – Only the following modes are supported by EnumCleaner: ‘allow’ and ‘disallow’.

Warning

None values are explicitly ignored as input values because F.lit(None).isin([“elem1”, “elem2”]) will neither return True nor False but None. If you want to replace Null values you should use the transformer spooq.transformer.NullCleaner

transform(input_df)[source]

Performs a transformation on a DataFrame.

Parameters

input_df (DataFrame) – Input DataFrame

Returns

Transformed DataFrame.

Return type

DataFrame

Note

This method does only take the Input DataFrame as a parameters. Any other needed parameters are defined in the initialization of the Transformator Object.

Base Class

This abstract class provides the functionality to log any cleansed values into a separate column that contains a struct with a sub column per cleansed column (according to the cleaning_definition). If a value was cleansed, the original value will be stored in its respective sub column. If a value was not cleansed, the sub column will be empty (None).

class BaseCleaner(cleaning_definitions, column_to_log_cleansed_values, store_as_map=False, temporary_columns_prefix='1b75cdd2e2356a35486230c69cfac5493488a919')[source]