hi i m trying to ingest s3 and the job keeps failing through DataHub #troubleshoot

Join Slack

hi, i’m trying to ingest s3 and the job keeps fail...

# troubleshoot

red-vr-34382

07/29/2022, 3:42 PM

hi, i’m trying to ingest s3 and the job keeps failing (through the UI).

little-megabyte-1074

07/30/2022, 12:43 AM

Hey @red-vr-34382! Gentle reminder to please follow our Slack Guidelines & make use of threads when posting large blocks of code/stack trace - it’s a HUGE help for us to keep track of open questions across our various support channels teamwork

red-vr-34382

07/31/2022, 5:15 PM

oops, sorry @little-megabyte-1074

red-vr-34382

07/31/2022, 5:15 PM

here’s the logs

Copy code

'[2022-07-29 15:29:02,103] INFO     {datahub.cli.ingest_cli:99} - DataHub CLI version: 0.8.41\n'
           '[2022-07-29 15:29:02,185] INFO     {datahub.ingestion.run.pipeline:160} - Sink configured successfully. DataHubRestEmitter: configured '
           'to talk to <http://datahub-datahub-gms:8080>\n'
           '[2022-07-29 15:29:03,491] ERROR    {logger:26} - Please set env variable SPARK_VERSION\n'
           'JAVA_HOME is not set\n'
           '[2022-07-29 15:29:04,334] ERROR    {datahub.ingestion.run.pipeline:126} - Java gateway process exited before sending its port number\n'
           '[2022-07-29 15:29:04,335] INFO     {datahub.cli.ingest_cli:115} - Starting metadata ingestion\n'
           '[2022-07-29 15:29:04,335] INFO     {datahub.cli.ingest_cli:133} - Finished metadata pipeline\n'
           '\n'
           'Failed to configure source (s3) due to Java gateway process exited before sending its port number\n',
           "2022-07-29 15:29:06.772563 [exec_id=bb106a8d-7b27-43a6-a5b0-1fe8ab710863] INFO: Failed to execute 'datahub ingest'",
           '2022-07-29 15:29:06.773591 [exec_id=bb106a8d-7b27-43a6-a5b0-1fe8ab710863] INFO: Caught exception EXECUTING '
           'task_id=bb106a8d-7b27-43a6-a5b0-1fe8ab710863, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 121, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
           '    return f.result()\n'
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
           '    raise self._exception\n'
           '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
           '    result = coro.send(None)\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
Execution finished with errors.

and my yaml is:

Copy code

source:
    type: s3
    config:
        path_specs:
            -
                include: 's3://...'
        aws_config:
            aws_access_key_id: ...
            aws_secret_access_key: ...
            aws_region: us-east-2
        profiling:
            enabled: true

any ideas what the problem could be?

red-vr-34382

07/31/2022, 5:15 PM

(the

...

have actual values)

careful-pilot-86309

08/01/2022, 6:14 AM

@red-vr-34382 Only subset of sources are supported from managed ingestion at this point. Unfortunately, at this point, the container which runs the ingestion doesnt have spark installed on it. Spark is requirement for s3 profiling. Please setup spark and execute recipe from outside. I am taking a note to improve the documentation around this.

red-vr-34382

08/01/2022, 2:00 PM

ahhh got it, thank you

2 Views

Open in Slack

Previous Next