Threshold-based Cleaner¶
-
class
ThresholdCleaner
(thresholds={})[source]¶ Bases:
spooq2.transformer.transformer.Transformer
Sets outiers within a DataFrame to a default value. Takes a dictionary with valid value ranges for each column to be cleaned.
Example
>>> transformer = ThresholdCleaner( >>> thresholds={ >>> "created_at": { >>> "min": 0, >>> "max": 1580737513, >>> "default": None >>> }, >>> "size_cm": { >>> "min": 70, >>> "max": 250, >>> "default": None >>> }, >>> } >>> )
Parameters: thresholds ( dict
) – Dictionary containing column names and respective valid rangesReturns: The transformed DataFrame Return type: pyspark.sql.DataFrame
Raises: exceptions.ValueError
– Threshold-based cleaning only supports Numeric, Date and Timestamp Types! Column with name: {col_name} and type of: {col_type} was providedWarning
Only Numeric, TimestampType, and DateType data types are supported!
-
transform
(input_df)[source]¶ Performs a transformation on a DataFrame.
Parameters: input_df ( pyspark.sql.DataFrame
) – Input DataFrameReturns: Transformed DataFrame. Return type: pyspark.sql.DataFrame
Note
This method does only take the Input DataFrame as a parameters. All other needed parameters are defined in the initialization of the Transformator Object.
-