Null Cleaner¶
-
class
NullCleaner
(cleaning_definitions=None, column_to_log_cleansed_values=None, store_as_map=False)[source]¶ Bases:
spooq.transformer.base_cleaner.BaseCleaner
Fills Null values of the specifield fields. Takes a dictionary with the fields to be cleaned and the default value to be set when the field is null.
Examples
>>> from pyspark.sql import functions as F >>> from spooq.transformer import NullCleaner >>> transformer = NullCleaner( >>> cleaning_definitions={ >>> "points": { >>> "default": 0 >>> } >>> } >>> )
>>> from spooq.transformer import NullCleaner >>> from pyspark.sql import Row >>> >>> input_df = spark.createDataFrame([ >>> Row(id=0, points=5), >>> Row(id=1, points= None), >>> Row(id=2, points=15), >>> ]) >>> transformer = NullCleaner( >>> cleaning_definitions={ >>> "points": {"default": 0}, >>> }, >>> column_to_log_cleansed_values="cleansed_values_null", >>> store_as_map=True, >>> ) >>> output_df = transformer.transform(input_df) >>> output_df.show() +---+------+--------------------+ | id|points|cleansed_values_null| +---+------+--------------------+ | 0| 5| null| | 1| 0| [points -> null]| | 2| 15| null| +---+------+--------------------+ >>> output_df.printSchema() |-- id: long (nullable = true) |-- points: long (nullable = true) |-- cleansed_values_null: map (nullable = true) | |-- key: string | |-- value: string (valueContainsNull = true)
Parameters: - cleaning_definitions (
dict
) – Dictionary containing column names and respective default values - column_to_log_cleansed_values (
str
, Defaults to None) – Defines a column in which the original (uncleansed) value will be stored in case of cleansing. If no column name is given, nothing will be logged. - store_as_map (
bool
, Defaults to False) – Specifies if the logged cleansed values should be stored in a column aspyspark.sql.types.MapType
or aspyspark.sql.types.StructType
with stringified values.
Note
The following cleaning_definitions attributes per column are mandatory:
- default -
Column
or any primitive Python value - If a value gets cleansed it gets replaced with the provided default value.
- default -
Returns: The transformed DataFrame Return type: DataFrame
Raises: exceptions.ValueError
– Null-based cleaning requires the field default. Default parameter is not specified for column with name: {column_name}-
transform
(input_df)[source]¶ Performs a transformation on a DataFrame.
Parameters: input_df ( DataFrame
) – Input DataFrameReturns: Transformed DataFrame. Return type: DataFrame
Note
This method does only take the Input DataFrame as a parameters. Any other needed parameters are defined in the initialization of the Transformator Object.
- cleaning_definitions (
Base Class¶
This abstract class provides the functionality to log any cleansed values into a separate column that contains a struct with a sub column per cleansed column (according to the cleaning_definition). If a value was cleansed, the original value will be stored in its respective sub column. If a value was not cleansed, the sub column will be empty (None).
-
class
BaseCleaner
(cleaning_definitions, column_to_log_cleansed_values, store_as_map=False, temporary_columns_prefix='1b75cdd2e2356a35486230c69cfac5493488a919')[source] Bases:
spooq.transformer.transformer.Transformer