I'm running datahub locally from the `getting-star...
# ingestion
w
I'm running datahub locally from the
getting-started
docker container images based on the Quickstart guide I tried setting up a connection to our RS cluster and ran into this error:
Copy code
'2022-09-27 18:30:45.362431 [exec_id=fec3ab48-c33b-4403-abfc-f61720c609ae] INFO: Starting execution for task with name=RUN_INGEST',
           '2022-09-27 18:47:03.670827 [exec_id=fec3ab48-c33b-4403-abfc-f61720c609ae] INFO: Caught exception EXECUTING '
           'task_id=fec3ab48-c33b-4403-abfc-f61720c609ae, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 182, in execute\n'
           '    await tasks.gather(_read_output_lines(), _report_progress(), _process_waiter())\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 126, in '
           '_read_output_lines\n'
           '    full_log_file.write(line)\n'
           'OSError: [Errno 28] No space left on device\n'
           '\n'
           'During handling of the above exception, another exception occurred:\n'
           '\n'
           'OSError: [Errno 28] No space left on device\n'
           '\n'
           'During handling of the above exception, another exception occurred:\n'
           '\n'
           'Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 188, in execute\n'
           '    full_log_file.close()\n'
           'OSError: [Errno 28] No space left on device\n'
Ingestion recipe yml looks something like this:
Copy code
source:
    type: redshift
    config:
        start_time: '2022-09-26 00:00:00Z'
        end_time: '2022-09-26 12:00:00Z'
        table_lineage_mode: mixed
        include_table_lineage: true
        database: insightsetl
        password: '${etl2_test_datahub_creds}'
        profiling:
            enabled: true
        host_port: '<http://pi-redshift-etl-2-test.ccvpgkqogsrc.us-east-1.redshift.amazonaws.com:8192|pi-redshift-etl-2-test.ccvpgkqogsrc.us-east-1.redshift.amazonaws.com:8192>'
        stateful_ingestion:
            enabled: true
        username: datahub_ingestion
pipeline_name: 'urn:li:dataHubIngestionSource:93b5640d-8ed3-456e-89f9-0ec3def38733'
I'm not sure if it's a container issue or config or something else.
g
Looks like it’s running out of disk space in the execution container while attempting to write logs to disk
The short-term solve would be to simply free up some space on your system / in your docker config, and long-term we’ll start looking into some mechanisms for log rotation
w
Thanks Harshal - I'm pretty new to docker - which container would need the additional space?
Thanks! Let me try that.
FWIW - I tried increasing the size and restarting docker which didn't seem to work 🤷 I did turn off
include_copy_lineage
and I was finally able to get a successful ingestion run.
Copy code
source:
    type: redshift
    config:
        table_lineage_mode: stl_scan_based
        include_table_lineage: true
        include_copy_lineage: false
        database: insightsetl
        password: '${etl2test}'
        profiling:
            enabled: false
        host_port: '<http://pi-redshift-etl-2-test.ccvpgkqogsrc.us-east-1.redshift.amazonaws.com:8192|pi-redshift-etl-2-test.ccvpgkqogsrc.us-east-1.redshift.amazonaws.com:8192>'
        stateful_ingestion:
            enabled: false
        username: datahub_ingestion
pipeline_name: 'urn:li:dataHubIngestionSource:ef3016df-5b79-48fa-be92-885b2eba0ff0'
This is a cluster in our Beta stage into which we copy some sample data from upstream stages. So there's lots of copies happening all the time.
That being said ... is there any easy way to see which tables have been updated?
g
When you ran it
include_copy_lineage
enabled, did it produce any warnings or errors in the logs?
w
Kafka saying records were too long
Ended up crashing containers ... or something ... had to "hit play" again in Docker 🙂
I can try and recreate it if youd' like
g
That makes sense - my current hypothesis here is that a single redshift table is being populated by copies from a bunch of different s3 objects (most likely one copy command per hour/day) and we’re not collapsing that down to a logical path on our end
w
That matches from what I can recall of the logs
Tangentially - I'm trying to evaluate Datahub's usage of SQLlineage (we are using that same module on our side but not at the same scale) and I want to get a sense of correctness of the resulting lineage. It's just hard to do because this is lightly used cluster (aside from the "table subscriptions" from upstream stages). Is there way to quickly identify new lineage information that was generated from my most recent run? I stumbled upon one table but there are about 3100 and I don't want to keep guess-and-checking 🙂
g
Here new means “lineage that wasn’t present in the last run, but showed up now” right?
That’d definitely be possible by subscribing to the MetadataChangeLog kafka topic, but I’m just trying to think if there’s any easier way
w
Here new means “lineage that wasn’t present in the last run, but showed up now” right?
Correct