Enumeration-based Cleaner
- class EnumCleaner(cleaning_definitions={}, column_to_log_cleansed_values=None, store_as_map=False)[source]
Cleanses a dataframe based on lists of allowed|disallowed values.
Examples
>>> from spooq.transformer import EnumCleaner >>> >>> transformer = EnumCleaner( >>> cleaning_definitions={ >>> "status": { >>> "elements": ["active", "inactive"], >>> }, >>> "version": { >>> "elements": ["", "None", "none", "null", "NULL"], >>> "mode": "disallow", >>> "default": None, >>> }, >>> } >>> )
>>> from spooq.transformer import EnumCleaner >>> from pyspark.sql import Row >>> >>> input_df = spark.createDataFrame([ >>> Row(a="stay", b="positive"), >>> Row(a="stay", b="negative"), >>> Row(a="stay", b="positive"), >>> ]) >>> transformer = EnumCleaner( >>> cleaning_definitions={ >>> "b": { >>> "elements": ["positive"], >>> "mode": "allow", >>> } >>> }, >>> column_to_log_cleansed_values="cleansed_values_enum", >>> store_as_map=True, >>> ) >>> output_df = transformer.transform(input_df) >>> output_df.show() +----+--------+--------------------+ | a| b|cleansed_values_enum| +----+--------+--------------------+ |stay|positive| []| |stay| null| [b -> negative]| |stay|positive| []| +----+--------+--------------------+ >>> output_df.printSchema() root |-- a: string (nullable = true) |-- b: string (nullable = true) |-- cleansed_values_enum: map (nullable = false) | |-- key: string | |-- value: string (valueContainsNull = true)
- Parameters
cleaning_definitions (
dict
) – Dictionary containing column names and respective cleansing rulescolumn_to_log_cleansed_values (
str
, Defaults to None) – Defines a column in which the original (uncleansed) value will be stored in case of cleansing. If no column name is given, nothing will be logged.store_as_map (
bool
, Defaults to False) – Specifies if the logged cleansed values should be stored in a column aspyspark.sql.types.MapType
with stringified values or aspyspark.sql.types.StructType
with the original respective data types.
Note
Following cleansing rule attributes per column are supported:
- elements, mandatory -
list
A list of elements which will be used to allow or reject (based on mode) values from the input DataFrame.
- elements, mandatory -
- mode, allow|disallow, defaults to ‘allow’ -
str
“allow” will set all values which are NOT in the list (ignoring NULL) to the default value. “disallow” will set all values which ARE in the list (ignoring NULL) to the default value.
- mode, allow|disallow, defaults to ‘allow’ -
- default, defaults to None -
Column
or any primitive Python value If a value gets cleansed it gets replaced with the provided default value.
- default, defaults to None -
- Returns
The transformed DataFrame
- Return type
- Raises
exceptions.ValueError – Enumeration-based cleaning requires a non-empty list of elements per cleaning rule! Spooq did not find such a list for column: {column_name}
exceptions.ValueError – Only the following modes are supported by EnumCleaner: ‘allow’ and ‘disallow’.
Warning
None values are explicitly ignored as input values because F.lit(None).isin([“elem1”, “elem2”]) will neither return True nor False but None. If you want to replace Null values you should use the transformer
spooq.transformer.NullCleaner
- transform(input_df)[source]
Performs a transformation on a DataFrame.
- Parameters
input_df (
DataFrame
) – Input DataFrame- Returns
Transformed DataFrame.
- Return type
Note
This method does only take the Input DataFrame as a parameters. Any other needed parameters are defined in the initialization of the Transformator Object.
Base Class
This abstract class provides the functionality to log any cleansed values into a separate column that contains a struct with a sub column per cleansed column (according to the cleaning_definition). If a value was cleansed, the original value will be stored in its respective sub column. If a value was not cleansed, the sub column will be empty (None).
- class BaseCleaner(cleaning_definitions, column_to_log_cleansed_values, store_as_map=False, temporary_columns_prefix='1b75cdd2e2356a35486230c69cfac5493488a919')[source]