Any recommendations on setting a include and exclude path sp DataHub #ingestion

Any recommendations on setting a include and exclu...

shy-lion-56425

09/21/2022, 8:50 PM

Any recommendations on setting a include and exclude path_specs for s3?

Copy code

source:
    type: s3
    config:
        path_specs:
        - include : "<s3://cseo-global-cloudtrail/AWSLogs/057183463473/{table}/{partition[0]}/{partition[1]}/{partition[2]}/{partition[3]}/*_CloudTrail-Digest_*.json.gz>"
        - exclude : "**/AWSLogs/057183463473/CloudTrail-Digest/**"
        aws_config:
            aws_access_key_id: "{aws_key}"
            aws_secret_access_key: "{aws_secret}"
            aws_region: us-east-1
        profiling:
            enabled: false

Error:

Copy code

[2022-09-21 15:47:53,596] ERROR    {datahub.ingestion.run.pipeline:127} - 'include'
Traceback (most recent call last):
  File "/Users/raithels/opt/anaconda3/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 178, in __init__
    self.source: Source = source_class.create(
  File "/Users/raithels/opt/anaconda3/lib/python3.9/site-packages/datahub/ingestion/source/s3/source.py", line 321, in create
    config = DataLakeSourceConfig.parse_obj(config_dict)
  File "pydantic/main.py", line 521, in pydantic.main.BaseModel.parse_obj
  File "pydantic/main.py", line 339, in pydantic.main.BaseModel.__init__
  File "pydantic/main.py", line 1056, in pydantic.main.validate_model
  File "pydantic/fields.py", line 868, in pydantic.fields.ModelField.validate
  File "pydantic/fields.py", line 901, in pydantic.fields.ModelField._validate_sequence_like
  File "pydantic/fields.py", line 1067, in pydantic.fields.ModelField._validate_singleton
  File "pydantic/fields.py", line 857, in pydantic.fields.ModelField.validate
  File "pydantic/fields.py", line 1074, in pydantic.fields.ModelField._validate_singleton
  File "pydantic/fields.py", line 1121, in pydantic.fields.ModelField._apply_validators
  File "pydantic/class_validators.py", line 313, in pydantic.class_validators._generic_validator_basic.lambda12
  File "pydantic/main.py", line 704, in pydantic.main.BaseModel.validate
  File "pydantic/main.py", line 339, in pydantic.main.BaseModel.__init__
  File "pydantic/main.py", line 1082, in pydantic.main.validate_model
  File "/Users/raithels/opt/anaconda3/lib/python3.9/site-packages/datahub/ingestion/source/aws/path_spec.py", line 104, in validate_path_spec
    if "**" in values["include"]:
KeyError: 'include'
[2022-09-21 15:47:53,598] INFO     {datahub.cli.ingest_cli:119} - Starting metadata ingestion
[2022-09-21 15:47:53,598] INFO     {datahub.cli.ingest_cli:137} - Finished metadata ingestion

helpful-optician-78938

09/21/2022, 10:07 PM

Hi @shy-lion-56425, path_specs is an

Optional[List[PathSpec]]

. You need to provide that as a YAML list of objects. Try something like the following.

Copy code

path_specs:
   - include: <value>
     exclude: <value>
   - include: <value>
     exclude: <value>

For valid values for each of these, see Path Spec.

shy-lion-56425

09/22/2022, 1:19 PM

@helpful-optician-78938 thanks for the suggestion. I was able to get it to work with the following:

Copy code

- include : "s3_path"
  exclude : 
    - "s3_exclude_path"

exclude expects a list so you have to use the

notation under each exclude string.

shy-lion-56425

09/22/2022, 5:22 PM

@helpful-optician-78938 if i have a .gz file that contains json but doesn't have a .json suffix, is there an option to get the schema process? Is this currently supported?

helpful-optician-78938

09/22/2022, 5:38 PM

I'll look into this and get back to you soon @shy-lion-56425.

shy-lion-56425

10/11/2022, 7:56 PM

@helpful-optician-78938 just following up to see if you've found anything. Looks the dataset is processing but won't extract any field names.

helpful-optician-78938

10/11/2022, 8:12 PM

cc: @gray-shoe-75895

shy-lion-56425

10/11/2022, 8:18 PM

Best I can tell the s3 source.py is defaulting to open instead of smart_open which causes the .gz file to fail on the

json.JsonInferrer.infer_schema

here's a quick example:

Copy code

import gzip
from smart_open import open as smart_open
from datahub.ingestion.source.schema_inference import json

file_path = 'example_compressed_json.gz'
test = json.JsonInferrer()


# gzip forced (returns fields)
file = gzip.open(file_path,'rb')
test.infer_schema(file) == []

# smart open (returns fields)
file = smart_open(file_path,mode = 'rb')
test.infer_schema(file) == []

# standard open (returns no fields)
file = open(file_path, mode = 'rb')
test.infer_schema(file) == []

gray-shoe-75895

10/11/2022, 10:32 PM

You should be able to use

default_extension: json

config under the path_spec to make it use the json schema inference. For your reference, the exact code we use is here https://github.com/datahub-project/datahub/blob/228f3b50ea26e21d133bb8bcbebc581ac7[…]9d/metadata-ingestion/src/datahub/ingestion/source/s3/source.py

shy-lion-56425

10/12/2022, 1:43 PM

@gray-shoe-75895 thanks for reaching out. I'm currently using the default_extension: json config and still getting the same results as the example above.

shy-lion-56425

10/12/2022, 3:08 PM

Also logged an issue here with more details: https://github.com/datahub-project/datahub/issues/6181

gray-shoe-75895

10/13/2022, 3:06 AM

Thanks - responded on the github issue

shy-lion-56425

10/13/2022, 1:35 PM

Just closed the issue but for anyone else the issue was with my version of ujson. Here's the solution that worked for me:

pip install --upgrade ujson

10 Views

Open in Slack

Previous Next