hi, i’m trying to ingest s3 and the job keeps fail...
# troubleshoot
r
hi, i’m trying to ingest s3 and the job keeps failing (through the UI).
l
Hey @red-vr-34382! Gentle reminder to please follow our Slack Guidelines & make use of threads when posting large blocks of code/stack trace - it’s a HUGE help for us to keep track of open questions across our various support channels teamwork
r
oops, sorry @little-megabyte-1074
here’s the logs
Copy code
'[2022-07-29 15:29:02,103] INFO     {datahub.cli.ingest_cli:99} - DataHub CLI version: 0.8.41\n'
           '[2022-07-29 15:29:02,185] INFO     {datahub.ingestion.run.pipeline:160} - Sink configured successfully. DataHubRestEmitter: configured '
           'to talk to <http://datahub-datahub-gms:8080>\n'
           '[2022-07-29 15:29:03,491] ERROR    {logger:26} - Please set env variable SPARK_VERSION\n'
           'JAVA_HOME is not set\n'
           '[2022-07-29 15:29:04,334] ERROR    {datahub.ingestion.run.pipeline:126} - Java gateway process exited before sending its port number\n'
           '[2022-07-29 15:29:04,335] INFO     {datahub.cli.ingest_cli:115} - Starting metadata ingestion\n'
           '[2022-07-29 15:29:04,335] INFO     {datahub.cli.ingest_cli:133} - Finished metadata pipeline\n'
           '\n'
           'Failed to configure source (s3) due to Java gateway process exited before sending its port number\n',
           "2022-07-29 15:29:06.772563 [exec_id=bb106a8d-7b27-43a6-a5b0-1fe8ab710863] INFO: Failed to execute 'datahub ingest'",
           '2022-07-29 15:29:06.773591 [exec_id=bb106a8d-7b27-43a6-a5b0-1fe8ab710863] INFO: Caught exception EXECUTING '
           'task_id=bb106a8d-7b27-43a6-a5b0-1fe8ab710863, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 121, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
           '    return f.result()\n'
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
           '    raise self._exception\n'
           '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
           '    result = coro.send(None)\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 115, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
Execution finished with errors.
and my yaml is:
Copy code
source:
    type: s3
    config:
        path_specs:
            -
                include: 's3://...'
        aws_config:
            aws_access_key_id: ...
            aws_secret_access_key: ...
            aws_region: us-east-2
        profiling:
            enabled: true
any ideas what the problem could be?
(the
...
have actual values)
c
@red-vr-34382 Only subset of sources are supported from managed ingestion at this point. Unfortunately, at this point, the container which runs the ingestion doesnt have spark installed on it. Spark is requirement for s3 profiling. Please setup spark and execute recipe from outside. I am taking a note to improve the documentation around this.
r
ahhh got it, thank you