JSON Files¶
-
class
JSONExtractor
(input_path=None, base_path=None, partition=None)[source]¶ Bases:
spooq2.extractor.extractor.Extractor
The JSONExtractor class provides an API to extract data stored as JSON format, deserializes it into a PySpark dataframe and returns it. Currently only single-line JSON files are supported, stored either as textFile or sequenceFile.
Examples
>>> from spooq2 import extractor as E
>>> extractor = E.JSONExtractor(input_path="tests/data/schema_v1/sequenceFiles") >>> extractor.input_path == "tests/data/schema_v1/sequenceFiles" + "/*" True
>>> extractor = E.JSONExtractor( >>> base_path="tests/data/schema_v1/sequenceFiles", >>> partition="20200201" >>> ) >>> extractor.input_path == "tests/data/schema_v1/sequenceFiles" + "/20/02/01" + "/*" True
Parameters: - input_path (
str
) – The path from which the JSON files should be loaded (“/*” will be added if omitted) - base_path (
str
) – Spooq tries to infer theinput_path
from thebase_path
and thepartition
if theinput_path
is missing. - partition (
str
orint
) – Spooq tries to infer theinput_path
from thebase_path
and thepartition
if theinput_path
is missing. Only daily partitions in the form of “YYYYMMDD” are supported. e.g., “20200201” => <base_path> + “/20/02/01/*”
Returns: The extracted data set as a PySpark DataFrame
Return type: Raises: AttributeError
– Please define eitherinput_path
orbase_path
andpartition
Warning
Currently only single-line JSON files stored as SequenceFiles or TextFiles are supported!
Note
The init method checks which input parameters are provided and derives the final input_path from them accordingly.
- If
input_path
is notNone
: - Cleans
input_path
and returns it as the finalinput_path
- Elif
base_path
andpartition
are notNone
: - Cleans
base_path
, infers the sub path from thepartition
and returns the combined string as the finalinput_path
- Else:
- Raises an
AttributeError
-
extract
()[source]¶ This is the Public API Method to be called for all classes of Extractors
Returns: Complex PySpark DataFrame deserialized from the input JSON Files Return type: pyspark.sql.DataFrame
- input_path (