Threshold-based Cleaner

class ThresholdCleaner(thresholds={})[source]

Bases: spooq2.transformer.transformer.Transformer

Sets outiers within a DataFrame to a default value. Takes a dictionary with valid value ranges for each column to be cleaned.

Example

>>> transformer = ThresholdCleaner(
>>>     thresholds={
>>>         "created_at": {
>>>             "min": 0,
>>>             "max": 1580737513,
>>>             "default": None
>>>         },
>>>         "size_cm": {
>>>             "min": 70,
>>>             "max": 250,
>>>             "default": None
>>>         },
>>>     }
>>> )
Parameters:thresholds (dict) – Dictionary containing column names and respective valid ranges
Returns:The transformed DataFrame
Return type:pyspark.sql.DataFrame
Raises:exceptions.ValueError – Threshold-based cleaning only supports Numeric, Date and Timestamp Types! Column with name: {col_name} and type of: {col_type} was provided

Warning

Only Numeric, TimestampType, and DateType data types are supported!

transform(input_df)[source]

Performs a transformation on a DataFrame.

Parameters:input_df (pyspark.sql.DataFrame) – Input DataFrame
Returns:Transformed DataFrame.
Return type:pyspark.sql.DataFrame

Note

This method does only take the Input DataFrame as a parameters. All other needed parameters are defined in the initialization of the Transformator Object.