Hello all, I am trying to ingest deltalake table m...
# ingestion
g
Hello all, I am trying to ingest deltalake table metadata to datahub using the new delta-lake ingestion process. Can someone let me know how I can use an AWS role to configure s3 permission? I don’t want to use aws access key and secret key. I am using a lambda function to call the datahub rest. So ideally I want to use the lambda’s role to access s3. an example would be helpful that shows how the aws_role is to be provided.
g
Are you run the datahub instance in aws?
g
yes, I am running datahub using ECS
g
@green-lion-58215 just ignore the parameter that not required. aws creds will be automatically detected based on boto3. 🙂
g
We are planning to execute the ingestion CLI command from a lambda function that has access to datahub. the lambda execution role also has access to the deltat table s3 locations. In that case, how do we pass the lambda execution role to the recipe for it to fetch the delta tables form s3? I see that the ole is expected to be provided in a generic dict format. can you provide an example for this?
I am seeing this error while execution
Copy code
[ERROR] TypeError: argument 'storage_options': 'NoneType' object cannot be converted to 'PyString'
Traceback (most recent call last):
  File "/var/task/app.py", line 131, in lambda_handler
    raise e
  File "/var/task/app.py", line 111, in lambda_handler
    pipeline.run()
  File "/var/task/datahub/ingestion/run/pipeline.py", line 263, in run
    for wu in itertools.islice(
  File "/var/task/datahub/ingestion/source/delta_lake/source.py", line 263, in get_workunits
    for wu in self.process_folder(
  File "/var/task/datahub/ingestion/source/delta_lake/source.py", line 228, in process_folder
    delta_table = read_delta_table(path, self.source_config)
  File "/var/task/datahub/ingestion/source/delta_lake/delta_lake_utils.py", line 23, in read_delta_table
    delta_table = DeltaTable(path, storage_options=opts)
  File "/var/task/deltalake/table.py", line 90, in __init__
    self._table = RawDeltaTable(
for context, I am able to read the s3 files using the lamda code using boto3
and this is the config I used
Copy code
config = {
            "type": "delta-lake",
            "config": {
                "base_path": "<s3://coursera-data-engineering-spo/databricks/domain/layer/>",
                "s3": {
                    "aws_config": {
                        "aws_region": "us-east-1"
                        }
                    }
                }
        }
also I see in this code that it always uses access and secret key for fetching delta tables? https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/delta_lake/delta_lake_utils.py