Hi all. I've recently setup datahub on gcp via: <h...
# ingestion
s
Hi all. I've recently setup datahub on gcp via: https://datahubproject.io/docs/deploy/gcp/ I've been able to ingest Big query data, but haven't been able to get S3 data lake to work. Here's the example yaml:
Copy code
source:
    type: s3
    config:
        profiling:
            enabled: false
        path_specs:
            -
                include: '<s3://MY_EXAMPLE_BUCKET/AWSLogs/0123456789/CloudTrail/us-east-1/2022/08/23/*.*>'
            -
                enable_compresion: true
        aws_config:
            aws_access_key_id: '${AWS_ACCESS_KEY_ID_CLOUDTRAIL}'
            aws_region: us-east-1
            aws_secret_access_key: '${AWS_SECRET_ACCESS_KEY_CLOUDTRAIL}'
However I get the following error:
Copy code
~~~~ Execution Summary ~~~~

RUN_INGEST - {'errors': [],
 'exec_id': '9e0f190f-05fd-407c-bdb9-16cebaed1d0c',
 'infos': ['2022-09-20 16:24:16.151405 [exec_id=9e0f190f-05fd-407c-bdb9-16cebaed1d0c] INFO: Starting execution for task with name=RUN_INGEST',
           '2022-09-20 16:24:18.213813 [exec_id=9e0f190f-05fd-407c-bdb9-16cebaed1d0c] INFO: stdout=Elapsed seconds = 0\n'
           '  --report-to TEXT                Provide an output file to produce a\n'
           'This version of datahub supports report-to functionality\n'
           'datahub --debug ingest run -c /tmp/datahub/ingest/9e0f190f-05fd-407c-bdb9-16cebaed1d0c/recipe.yml --report-to '
           '/tmp/datahub/ingest/9e0f190f-05fd-407c-bdb9-16cebaed1d0c/ingestion_report.json\n'
           '[2022-09-20 16:24:17,736] INFO     {datahub.cli.ingest_cli:170} - DataHub CLI version: 0.8.43.2\n'
           '[2022-09-20 16:24:17,769] INFO     {datahub.ingestion.run.pipeline:163} - Sink configured successfully. DataHubRestEmitter: configured '
           'to talk to <http://datahub-datahub-gms:8080>\n'
           "[2022-09-20 16:24:17,770] ERROR    {datahub.ingestion.run.pipeline:127} - s3 is disabled; try running: pip install 'acryl-datahub[s3]'\n"
           'Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 85, in _ensure_not_lazy\n'
           '    plugin_class = import_path(path)\n'
           '  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 32, in import_path\n'
           '    item = importlib.import_module(module_name)\n'
           '  File "/usr/local/lib/python3.9/importlib/__init__.py", line 127, in import_module\n'
           '    return _bootstrap._gcd_import(name[level:], package, level)\n'
           '  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import\n'
           '  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load\n'
           '  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked\n'
           '  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked\n'
           '  File "<frozen importlib._bootstrap_external>", line 850, in exec_module\n'
           '  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed\n'
           '  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/s3/__init__.py", line 1, in <module>\n'
           '    from datahub.ingestion.source.s3.source import S3Source\n'
           '  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/s3/source.py", line 10, in <module>\n'
           '    import pydeequ\n'
           "ModuleNotFoundError: No module named 'pydeequ'\n"
           '\n'
           'The above exception was the direct cause of the following exception:\n'
           '\n'
           'Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 172, in __init__\n'
           '    source_class = source_registry.get(source_type)\n'
           '  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 127, in get\n'
           '    raise ConfigurationError(\n'
           "datahub.configuration.common.ConfigurationError: s3 is disabled; try running: pip install 'acryl-datahub[s3]'\n"
           '[2022-09-20 16:24:17,773] INFO     {datahub.cli.ingest_cli:119} - Starting metadata ingestion\n'
           '[2022-09-20 16:24:17,774] INFO     {datahub.cli.ingest_cli:137} - Finished metadata ingestion\n'
           "[2022-09-20 16:24:17,919] ERROR    {datahub.entrypoints:188} - Command failed with 'Pipeline' object has no attribute 'source'. Run with "
           '--debug to get full trace\n'
           '[2022-09-20 16:24:17,920] INFO     {datahub.entrypoints:191} - DataHub CLI version: 0.8.43.2 at '
           '/usr/local/lib/python3.9/site-packages/datahub/__init__.py\n',
           "2022-09-20 16:24:18.214118 [exec_id=9e0f190f-05fd-407c-bdb9-16cebaed1d0c] INFO: Failed to execute 'datahub ingest'",
           '2022-09-20 16:24:18.214380 [exec_id=9e0f190f-05fd-407c-bdb9-16cebaed1d0c] INFO: Caught exception EXECUTING '
           'task_id=9e0f190f-05fd-407c-bdb9-16cebaed1d0c, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
           '    return f.result()\n'
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
           '    raise self._exception\n'
           '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
           '    result = coro.send(None)\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 142, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
Execution finished with errors.
h
Hi @shy-lion-56425, have you tried running
pip install 'acryl-datahub[s3]'
in the venv that you are trying to run the
datahub ingest
command from?
s
Hey @helpful-optician-78938 i'm relatively new to kubernetes, which container do i need to run the pip install on? The output above was running on the UI
h
cc: @bulky-soccer-26729 ^^
m
@shy-lion-56425 i think you have to install in the
datahub-action
container. Thats where
datahub ingest
runs
That is correct. That is the pod running the ingestion. Any missing python libs, you need to to that one