I m running datahub locally from the `getting started` docke DataHub #ingestion

I'm running datahub locally from the `getting-star...

wonderful-notebook-20086

09/27/2022, 7:18 PM

I'm running datahub locally from the

getting-started

docker container images based on the Quickstart guide I tried setting up a connection to our RS cluster and ran into this error:

Copy code

'2022-09-27 18:30:45.362431 [exec_id=fec3ab48-c33b-4403-abfc-f61720c609ae] INFO: Starting execution for task with name=RUN_INGEST',
           '2022-09-27 18:47:03.670827 [exec_id=fec3ab48-c33b-4403-abfc-f61720c609ae] INFO: Caught exception EXECUTING '
           'task_id=fec3ab48-c33b-4403-abfc-f61720c609ae, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 182, in execute\n'
           '    await tasks.gather(_read_output_lines(), _report_progress(), _process_waiter())\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 126, in '
           '_read_output_lines\n'
           '    full_log_file.write(line)\n'
           'OSError: [Errno 28] No space left on device\n'
           '\n'
           'During handling of the above exception, another exception occurred:\n'
           '\n'
           'OSError: [Errno 28] No space left on device\n'
           '\n'
           'During handling of the above exception, another exception occurred:\n'
           '\n'
           'Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 188, in execute\n'
           '    full_log_file.close()\n'
           'OSError: [Errno 28] No space left on device\n'

Ingestion recipe yml looks something like this:

Copy code

source:
    type: redshift
    config:
        start_time: '2022-09-26 00:00:00Z'
        end_time: '2022-09-26 12:00:00Z'
        table_lineage_mode: mixed
        include_table_lineage: true
        database: insightsetl
        password: '${etl2_test_datahub_creds}'
        profiling:
            enabled: true
        host_port: '<http://pi-redshift-etl-2-test.ccvpgkqogsrc.us-east-1.redshift.amazonaws.com:8192|pi-redshift-etl-2-test.ccvpgkqogsrc.us-east-1.redshift.amazonaws.com:8192>'
        stateful_ingestion:
            enabled: true
        username: datahub_ingestion
pipeline_name: 'urn:li:dataHubIngestionSource:93b5640d-8ed3-456e-89f9-0ec3def38733'

I'm not sure if it's a container issue or config or something else.

gray-shoe-75895

09/28/2022, 12:51 AM

Looks like it’s running out of disk space in the execution container while attempting to write logs to disk

gray-shoe-75895

09/28/2022, 12:52 AM

The short-term solve would be to simply free up some space on your system / in your docker config, and long-term we’ll start looking into some mechanisms for log rotation

wonderful-notebook-20086

09/28/2022, 5:47 PM

Thanks Harshal - I'm pretty new to docker - which container would need the additional space?

wonderful-notebook-20086

09/28/2022, 8:48 PM

Thanks! Let me try that.

wonderful-notebook-20086

09/28/2022, 9:05 PM

FWIW - I tried increasing the size and restarting docker which didn't seem to work 🤷 I did turn off

include_copy_lineage

and I was finally able to get a successful ingestion run.

Copy code

source:
    type: redshift
    config:
        table_lineage_mode: stl_scan_based
        include_table_lineage: true
        include_copy_lineage: false
        database: insightsetl
        password: '${etl2test}'
        profiling:
            enabled: false
        host_port: '<http://pi-redshift-etl-2-test.ccvpgkqogsrc.us-east-1.redshift.amazonaws.com:8192|pi-redshift-etl-2-test.ccvpgkqogsrc.us-east-1.redshift.amazonaws.com:8192>'
        stateful_ingestion:
            enabled: false
        username: datahub_ingestion
pipeline_name: 'urn:li:dataHubIngestionSource:ef3016df-5b79-48fa-be92-885b2eba0ff0'

This is a cluster in our Beta stage into which we copy some sample data from upstream stages. So there's lots of copies happening all the time.

wonderful-notebook-20086

09/28/2022, 9:06 PM

That being said ... is there any easy way to see which tables have been updated?

gray-shoe-75895

09/28/2022, 9:08 PM

When you ran it

include_copy_lineage

enabled, did it produce any warnings or errors in the logs?

wonderful-notebook-20086

09/28/2022, 9:09 PM

Kafka saying records were too long

wonderful-notebook-20086

09/28/2022, 9:09 PM

Ended up crashing containers ... or something ... had to "hit play" again in Docker 🙂

wonderful-notebook-20086

09/28/2022, 9:10 PM

I can try and recreate it if youd' like

gray-shoe-75895

09/28/2022, 9:12 PM

That makes sense - my current hypothesis here is that a single redshift table is being populated by copies from a bunch of different s3 objects (most likely one copy command per hour/day) and we’re not collapsing that down to a logical path on our end

wonderful-notebook-20086

09/28/2022, 9:14 PM

That matches from what I can recall of the logs

wonderful-notebook-20086

09/28/2022, 9:17 PM

Tangentially - I'm trying to evaluate Datahub's usage of SQLlineage (we are using that same module on our side but not at the same scale) and I want to get a sense of correctness of the resulting lineage. It's just hard to do because this is lightly used cluster (aside from the "table subscriptions" from upstream stages). Is there way to quickly identify new lineage information that was generated from my most recent run? I stumbled upon one table but there are about 3100 and I don't want to keep guess-and-checking 🙂

gray-shoe-75895

09/28/2022, 9:20 PM

Here new means “lineage that wasn’t present in the last run, but showed up now” right?

gray-shoe-75895

09/28/2022, 9:26 PM

That’d definitely be possible by subscribing to the MetadataChangeLog kafka topic, but I’m just trying to think if there’s any easier way

wonderful-notebook-20086

09/28/2022, 9:31 PM

Here new means “lineage that wasn’t present in the last run, but showed up now” right?

Correct

Open in Slack

Previous Next