Any recommendations on setting a include and exclu...
# ingestion
s
Any recommendations on setting a include and exclude path_specs for s3?
Copy code
source:
    type: s3
    config:
        path_specs:
        - include : "<s3://cseo-global-cloudtrail/AWSLogs/057183463473/{table}/{partition[0]}/{partition[1]}/{partition[2]}/{partition[3]}/*_CloudTrail-Digest_*.json.gz>"
        - exclude : "**/AWSLogs/057183463473/CloudTrail-Digest/**"
        aws_config:
            aws_access_key_id: "{aws_key}"
            aws_secret_access_key: "{aws_secret}"
            aws_region: us-east-1
        profiling:
            enabled: false
Error:
Copy code
[2022-09-21 15:47:53,596] ERROR    {datahub.ingestion.run.pipeline:127} - 'include'
Traceback (most recent call last):
  File "/Users/raithels/opt/anaconda3/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 178, in __init__
    self.source: Source = source_class.create(
  File "/Users/raithels/opt/anaconda3/lib/python3.9/site-packages/datahub/ingestion/source/s3/source.py", line 321, in create
    config = DataLakeSourceConfig.parse_obj(config_dict)
  File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj
  File "pydantic/main.py", line 339, in pydantic.main.BaseModel.__init__
  File "pydantic/main.py", line 1056, in pydantic.main.validate_model
  File "pydantic/fields.py", line 868, in pydantic.fields.ModelField.validate
  File "pydantic/fields.py", line 901, in pydantic.fields.ModelField._validate_sequence_like
  File "pydantic/fields.py", line 1067, in pydantic.fields.ModelField._validate_singleton
  File "pydantic/fields.py", line 857, in pydantic.fields.ModelField.validate
  File "pydantic/fields.py", line 1074, in pydantic.fields.ModelField._validate_singleton
  File "pydantic/fields.py", line 1121, in pydantic.fields.ModelField._apply_validators
  File "pydantic/class_validators.py", line 313, in pydantic.class_validators._generic_validator_basic.lambda12
  File "pydantic/main.py", line 704, in pydantic.main.BaseModel.validate
  File "pydantic/main.py", line 339, in pydantic.main.BaseModel.__init__
  File "pydantic/main.py", line 1082, in pydantic.main.validate_model
  File "/Users/raithels/opt/anaconda3/lib/python3.9/site-packages/datahub/ingestion/source/aws/path_spec.py", line 104, in validate_path_spec
    if "**" in values["include"]:
KeyError: 'include'
[2022-09-21 15:47:53,598] INFO     {datahub.cli.ingest_cli:119} - Starting metadata ingestion
[2022-09-21 15:47:53,598] INFO     {datahub.cli.ingest_cli:137} - Finished metadata ingestion
h
Hi @shy-lion-56425, path_specs is an
Optional[List[PathSpec]]
. You need to provide that as a YAML list of objects. Try something like the following.
Copy code
path_specs:
   - include: <value>
     exclude: <value>
   - include: <value>
     exclude: <value>
For valid values for each of these, see Path Spec.
s
@helpful-optician-78938 thanks for the suggestion. I was able to get it to work with the following:
Copy code
- include : "s3_path"
  exclude : 
    - "s3_exclude_path"
exclude expects a list so you have to use the
-
notation under each exclude string.
@helpful-optician-78938 if i have a .gz file that contains json but doesn't have a .json suffix, is there an option to get the schema process? Is this currently supported?
h
I'll look into this and get back to you soon @shy-lion-56425.
s
@helpful-optician-78938 just following up to see if you've found anything. Looks the dataset is processing but won't extract any field names.
h
cc: @gray-shoe-75895
s
Best I can tell the s3 source.py is defaulting to open instead of smart_open which causes the .gz file to fail on the
json.JsonInferrer.infer_schema
here's a quick example:
Copy code
import gzip
from smart_open import open as smart_open
from datahub.ingestion.source.schema_inference import json

file_path = 'example_compressed_json.gz'
test = json.JsonInferrer()


# gzip forced (returns fields)
file = gzip.open(file_path,'rb')
test.infer_schema(file) == []

# smart open (returns fields)
file = smart_open(file_path,mode = 'rb')
test.infer_schema(file) == []

# standard open (returns no fields)
file = open(file_path, mode = 'rb')
test.infer_schema(file) == []
g
You should be able to use
default_extension: json
config under the path_spec to make it use the json schema inference. For your reference, the exact code we use is here https://github.com/datahub-project/datahub/blob/228f3b50ea26e21d133bb8bcbebc581ac7[…]9d/metadata-ingestion/src/datahub/ingestion/source/s3/source.py
s
@gray-shoe-75895 thanks for reaching out. I'm currently using the default_extension: json config and still getting the same results as the example above.
Also logged an issue here with more details: https://github.com/datahub-project/datahub/issues/6181
g
Thanks - responded on the github issue
s
Just closed the issue but for anyone else the issue was with my version of ujson. Here's the solution that worked for me:
pip install --upgrade ujson