Null Cleaner

class NullCleaner(cleaning_definitions=None, column_to_log_cleansed_values=None, store_as_map=False)[source]

Fills Null values of the specifield fields. Takes a dictionary with the fields to be cleaned and the default value to be set when the field is null.

Examples

>>> from pyspark.sql import functions as F
>>> from spooq.transformer import NullCleaner
>>> transformer = NullCleaner(
>>>     cleaning_definitions={
>>>         "points": {
>>>             "default": 0
>>>         }
>>>     }
>>> )
>>> from spooq.transformer import NullCleaner
>>> from pyspark.sql import Row
>>>
>>> input_df = spark.createDataFrame([
>>>     Row(id=0, points=5),
>>>     Row(id=1, points= None),
>>>     Row(id=2, points=15),
>>> ])
>>> transformer = NullCleaner(
>>>     cleaning_definitions={
>>>         "points":    {"default":  0},
>>>     },
>>>     column_to_log_cleansed_values="cleansed_values_null",
>>>     store_as_map=True,
>>> )
>>> output_df = transformer.transform(input_df)
>>> output_df.show()
+---+------+--------------------+
| id|points|cleansed_values_null|
+---+------+--------------------+
|  0|     5|                null|
|  1|     0|    [points -> null]|
|  2|    15|                null|
+---+------+--------------------+
>>> output_df.printSchema()
 |-- id: long (nullable = true)
 |-- points: long (nullable = true)
 |-- cleansed_values_null: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)
Parameters
  • cleaning_definitions (dict) – Dictionary containing column names and respective default values

  • column_to_log_cleansed_values (str, Defaults to None) – Defines a column in which the original (uncleansed) value will be stored in case of cleansing. If no column name is given, nothing will be logged.

  • store_as_map (bool, Defaults to False) – Specifies if the logged cleansed values should be stored in a column as pyspark.sql.types.MapType or as pyspark.sql.types.StructType with stringified values.

Note

The following cleaning_definitions attributes per column are mandatory:

  • default - Column or any primitive Python value

    If a value gets cleansed it gets replaced with the provided default value.

Returns

The transformed DataFrame

Return type

DataFrame

Raises

exceptions.ValueError – Null-based cleaning requires the field default. Default parameter is not specified for column with name: {column_name}

transform(input_df)[source]

Performs a transformation on a DataFrame.

Parameters

input_df (DataFrame) – Input DataFrame

Returns

Transformed DataFrame.

Return type

DataFrame

Note

This method does only take the Input DataFrame as a parameters. Any other needed parameters are defined in the initialization of the Transformator Object.

Base Class

This abstract class provides the functionality to log any cleansed values into a separate column that contains a struct with a sub column per cleansed column (according to the cleaning_definition). If a value was cleansed, the original value will be stored in its respective sub column. If a value was not cleansed, the sub column will be empty (None).

class BaseCleaner(cleaning_definitions, column_to_log_cleansed_values, store_as_map=False, temporary_columns_prefix='1b75cdd2e2356a35486230c69cfac5493488a919')[source]