Good morning, so I was trying to ingest metadata f...
# ingestion
m
Good morning, so I was trying to ingest metadata from Kafka using the following recipe:
Copy code
source:
    type: kafka
    config:
        platform_instance: <platform_instance>
        connection:
            consumer_config:
                security.protocol: SASL_PLAINTEXT
                sasl.username: <user>
                sasl.mechanism: PLAIN
                sasl.password: <password>
            bootstrap: 'broker1:9092'
            schema_registry_url: '<http://schema-registry:8081>'
When I got the following error:
Copy code
File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 98, in _read_output_lines\n'
           '    line_bytes = await ingest_process.stdout.readline()\n'
           '  File "/usr/local/lib/python3.9/asyncio/streams.py", line 549, in readline\n'
           '    raise ValueError(e.args[0])\n'
           'ValueError: Separator is not found, and chunk exceed the limit\n']}
Mention that recipe worked in previous versions (the current version is v0.8.44) Thanks in advance!
g
Looks like one of the log lines was more than 64kb and overflowed our reporting buffer
I’ll bump up the log line length limit
m
But it doesn't affect the ingestion, does it?
g
If you run ingestion from the CLI, then it should work fine. Our “executor” system starts a subprocess that runs datahub ingest, and then captures the log lines and sends them back to the UI. In this case, line is so long that the executor fails to read it and crashes, and hence the overall ingestion tasks fails as well
a
I just ran into this problem as well. In addition to bump up the buffer size, should the log capture code also break down long log lines into chunks? This way, it is a little more robust. Otherwise, the ingestion could break again if the log line is over the new limit you set.
@gray-shoe-75895 any update on this? I just ran into this issue again when I tried to debug a bigquery UI ingestion. I enabled the debug logs for the ingestion.
g
Yes, the fix for this will be included in the next datahub-actions release
a
awesome, when is the next release?
g
We just cut a new release for datahub-actions v0.0.8 - it’s still building the docker containers (https://github.com/acryldata/datahub-actions/actions), but please test it out once those are built and pushed and let me know how it goes
The fix was two fold: (1) increasing the buffer size to 1mb instead of 64kb, and (2) truncating long log lines when sending/displaying in the UI - this was necessary because GMS rejects payloads that are over 4mb in size, but the full logs can still be found in the actions container file
/tmp/datahub/ingestion/<exec id>/ingestion_log.txt
a
Thank you @gray-shoe-75895! So now datahub-actions officially absorbed acryl-executor?
Oh, never mind. I saw you still use acryl-executor.
The ingestion tmp folder (
/tmp/datahub/ingestion/<exec id>
) is usually deleted after an ingestion run regardless if it succeeds or not. Have you changed this as well? Otherwise, we couldn't see the full logs after the ingestion is done.
g
Yep acryl-executor is still separate for now - the intention is to merge it into actions, but we just haven’t gotten around to it
Ah I think the logs directory may get deleted, but I’ll check and confirm.
We’ll probably need to build some log retention/autodeletion capabilities if we persist them for a longer period of time, since some of the logs can get quite large
a
Hello!
Would it be possible to bump the limit even further?
g
What version of datahub-actions do you have? I believe this was fixed in 0.0.8, and the latest is 0.0.9
a
We’re on 0.0.7 - I’ll upgrade! tysm @gray-shoe-75895!