JSON Files¶

class JSONExtractor(input_path=None, base_path=None, partition=None)[source]¶

Bases: spooq2.extractor.extractor.Extractor

The JSONExtractor class provides an API to extract data stored as JSON format, deserializes it into a PySpark dataframe and returns it. Currently only single-line JSON files are supported, stored either as textFile or sequenceFile.

Examples

>>> from spooq2 import extractor as E

>>> extractor = E.JSONExtractor(input_path="tests/data/schema_v1/sequenceFiles")
>>> extractor.input_path == "tests/data/schema_v1/sequenceFiles" + "/*"
True

>>> extractor = E.JSONExtractor(
>>>     base_path="tests/data/schema_v1/sequenceFiles",
>>>     partition="20200201"
>>> )
>>> extractor.input_path == "tests/data/schema_v1/sequenceFiles" + "/20/02/01" + "/*"
True

Parameters:	input_path (`str`) – The path from which the JSON files should be loaded (“/” will be added if omitted) base_path* (`str`) – Spooq tries to infer the `input_path` from the `base_path` and the `partition` if the `input_path` is missing. partition (`str` or `int`) – Spooq tries to infer the `input_path` from the `base_path` and the `partition` if the `input_path` is missing. Only daily partitions in the form of “YYYYMMDD” are supported. e.g., “20200201” => <base_path> + “/20/02/01/*”
Returns:	The extracted data set as a PySpark DataFrame
Return type:	`pyspark.sql.DataFrame`
Raises:	`AttributeError` – Please define either `input_path` or `base_path` and `partition`

Warning

Currently only single-line JSON files stored as SequenceFiles or TextFiles are supported!

Note

The init method checks which input parameters are provided and derives the final input_path from them accordingly.

If input_path is not None:: Cleans input_path and returns it as the final input_path
Elif base_path and partition are not None:: Cleans base_path, infers the sub path from the partition and returns the combined string as the final input_path
Else:: Raises an AttributeError

extract()[source]¶

This is the Public API Method to be called for all classes of Extractors

Returns:	Complex PySpark DataFrame deserialized from the input JSON Files
Return type:	`pyspark.sql.DataFrame`