shy-lion-56425
09/21/2022, 8:50 PMsource:
type: s3
config:
path_specs:
- include : "<s3://cseo-global-cloudtrail/AWSLogs/057183463473/{table}/{partition[0]}/{partition[1]}/{partition[2]}/{partition[3]}/*_CloudTrail-Digest_*.json.gz>"
- exclude : "**/AWSLogs/057183463473/CloudTrail-Digest/**"
aws_config:
aws_access_key_id: "{aws_key}"
aws_secret_access_key: "{aws_secret}"
aws_region: us-east-1
profiling:
enabled: false
Error:
[2022-09-21 15:47:53,596] ERROR {datahub.ingestion.run.pipeline:127} - 'include'
Traceback (most recent call last):
File "/Users/raithels/opt/anaconda3/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 178, in __init__
self.source: Source = source_class.create(
File "/Users/raithels/opt/anaconda3/lib/python3.9/site-packages/datahub/ingestion/source/s3/source.py", line 321, in create
config = DataLakeSourceConfig.parse_obj(config_dict)
File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj
File "pydantic/main.py", line 339, in pydantic.main.BaseModel.__init__
File "pydantic/main.py", line 1056, in pydantic.main.validate_model
File "pydantic/fields.py", line 868, in pydantic.fields.ModelField.validate
File "pydantic/fields.py", line 901, in pydantic.fields.ModelField._validate_sequence_like
File "pydantic/fields.py", line 1067, in pydantic.fields.ModelField._validate_singleton
File "pydantic/fields.py", line 857, in pydantic.fields.ModelField.validate
File "pydantic/fields.py", line 1074, in pydantic.fields.ModelField._validate_singleton
File "pydantic/fields.py", line 1121, in pydantic.fields.ModelField._apply_validators
File "pydantic/class_validators.py", line 313, in pydantic.class_validators._generic_validator_basic.lambda12
File "pydantic/main.py", line 704, in pydantic.main.BaseModel.validate
File "pydantic/main.py", line 339, in pydantic.main.BaseModel.__init__
File "pydantic/main.py", line 1082, in pydantic.main.validate_model
File "/Users/raithels/opt/anaconda3/lib/python3.9/site-packages/datahub/ingestion/source/aws/path_spec.py", line 104, in validate_path_spec
if "**" in values["include"]:
KeyError: 'include'
[2022-09-21 15:47:53,598] INFO {datahub.cli.ingest_cli:119} - Starting metadata ingestion
[2022-09-21 15:47:53,598] INFO {datahub.cli.ingest_cli:137} - Finished metadata ingestion
helpful-optician-78938
09/21/2022, 10:07 PMOptional[List[PathSpec]]
. You need to provide that as a YAML list of objects.
Try something like the following.
path_specs:
- include: <value>
exclude: <value>
- include: <value>
exclude: <value>
For valid values for each of these, see Path Spec.shy-lion-56425
09/22/2022, 1:19 PM- include : "s3_path"
exclude :
- "s3_exclude_path"
exclude expects a list so you have to use the -
notation under each exclude string.shy-lion-56425
09/22/2022, 5:22 PMhelpful-optician-78938
09/22/2022, 5:38 PMshy-lion-56425
10/11/2022, 7:56 PMhelpful-optician-78938
10/11/2022, 8:12 PMshy-lion-56425
10/11/2022, 8:18 PMjson.JsonInferrer.infer_schema
here's a quick example:
import gzip
from smart_open import open as smart_open
from datahub.ingestion.source.schema_inference import json
file_path = 'example_compressed_json.gz'
test = json.JsonInferrer()
# gzip forced (returns fields)
file = gzip.open(file_path,'rb')
test.infer_schema(file) == []
# smart open (returns fields)
file = smart_open(file_path,mode = 'rb')
test.infer_schema(file) == []
# standard open (returns no fields)
file = open(file_path, mode = 'rb')
test.infer_schema(file) == []
gray-shoe-75895
10/11/2022, 10:32 PMdefault_extension: json
config under the path_spec to make it use the json schema inference. For your reference, the exact code we use is here https://github.com/datahub-project/datahub/blob/228f3b50ea26e21d133bb8bcbebc581ac7[…]9d/metadata-ingestion/src/datahub/ingestion/source/s3/source.pyshy-lion-56425
10/12/2022, 1:43 PMshy-lion-56425
10/12/2022, 3:08 PMgray-shoe-75895
10/13/2022, 3:06 AMshy-lion-56425
10/13/2022, 1:35 PMpip install --upgrade ujson