Null Cleaner
- class NullCleaner(cleaning_definitions=None, column_to_log_cleansed_values=None, store_as_map=False)[source]
Fills Null values of the specifield fields. Takes a dictionary with the fields to be cleaned and the default value to be set when the field is null.
Examples
>>> from pyspark.sql import functions as F >>> from spooq.transformer import NullCleaner >>> transformer = NullCleaner( >>> cleaning_definitions={ >>> "points": { >>> "default": 0 >>> } >>> } >>> )
>>> from spooq.transformer import NullCleaner >>> from pyspark.sql import Row >>> >>> input_df = spark.createDataFrame([ >>> Row(id=0, points=5), >>> Row(id=1, points= None), >>> Row(id=2, points=15), >>> ]) >>> transformer = NullCleaner( >>> cleaning_definitions={ >>> "points": {"default": 0}, >>> }, >>> column_to_log_cleansed_values="cleansed_values_null", >>> store_as_map=True, >>> ) >>> output_df = transformer.transform(input_df) >>> output_df.show() +---+------+--------------------+ | id|points|cleansed_values_null| +---+------+--------------------+ | 0| 5| null| | 1| 0| [points -> null]| | 2| 15| null| +---+------+--------------------+ >>> output_df.printSchema() |-- id: long (nullable = true) |-- points: long (nullable = true) |-- cleansed_values_null: map (nullable = true) | |-- key: string | |-- value: string (valueContainsNull = true)
- Parameters
cleaning_definitions (
dict
) – Dictionary containing column names and respective default valuescolumn_to_log_cleansed_values (
str
, Defaults to None) – Defines a column in which the original (uncleansed) value will be stored in case of cleansing. If no column name is given, nothing will be logged.store_as_map (
bool
, Defaults to False) – Specifies if the logged cleansed values should be stored in a column aspyspark.sql.types.MapType
or aspyspark.sql.types.StructType
with stringified values.
Note
The following cleaning_definitions attributes per column are mandatory:
- default -
Column
or any primitive Python value If a value gets cleansed it gets replaced with the provided default value.
- default -
- Returns
The transformed DataFrame
- Return type
- Raises
exceptions.ValueError – Null-based cleaning requires the field default. Default parameter is not specified for column with name: {column_name}
- transform(input_df)[source]
Performs a transformation on a DataFrame.
- Parameters
input_df (
DataFrame
) – Input DataFrame- Returns
Transformed DataFrame.
- Return type
Note
This method does only take the Input DataFrame as a parameters. Any other needed parameters are defined in the initialization of the Transformator Object.
Base Class
This abstract class provides the functionality to log any cleansed values into a separate column that contains a struct with a sub column per cleansed column (according to the cleaning_definition). If a value was cleansed, the original value will be stored in its respective sub column. If a value was not cleansed, the sub column will be empty (None).
- class BaseCleaner(cleaning_definitions, column_to_log_cleansed_values, store_as_map=False, temporary_columns_prefix='1b75cdd2e2356a35486230c69cfac5493488a919')[source]