Good morning so I was trying to ingest metadata from Kafka u DataHub #ingestion

Good morning, so I was trying to ingest metadata f...

microscopic-mechanic-13766

09/14/2022, 8:23 AM

Good morning, so I was trying to ingest metadata from Kafka using the following recipe:

Copy code

source:
    type: kafka
    config:
        platform_instance: <platform_instance>
        connection:
            consumer_config:
                security.protocol: SASL_PLAINTEXT
                sasl.username: <user>
                sasl.mechanism: PLAIN
                sasl.password: <password>
            bootstrap: 'broker1:9092'
            schema_registry_url: '<http://schema-registry:8081>'

When I got the following error:

Copy code

File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 98, in _read_output_lines\n'
           '    line_bytes = await ingest_process.stdout.readline()\n'
           '  File "/usr/local/lib/python3.9/asyncio/streams.py", line 549, in readline\n'
           '    raise ValueError(e.args[0])\n'
           'ValueError: Separator is not found, and chunk exceed the limit\n']}

Mention that recipe worked in previous versions (the current version is v0.8.44) Thanks in advance!

gray-shoe-75895

09/14/2022, 8:38 PM

Looks like one of the log lines was more than 64kb and overflowed our reporting buffer

gray-shoe-75895

09/14/2022, 8:39 PM

I’ll bump up the log line length limit

microscopic-mechanic-13766

09/15/2022, 9:40 AM

But it doesn't affect the ingestion, does it?

gray-shoe-75895

09/15/2022, 9:19 PM

If you run ingestion from the CLI, then it should work fine. Our “executor” system starts a subprocess that runs datahub ingest, and then captures the log lines and sends them back to the UI. In this case, line is so long that the executor fails to read it and crashes, and hence the overall ingestion tasks fails as well

able-evening-90828

09/16/2022, 12:26 AM

I just ran into this problem as well. In addition to bump up the buffer size, should the log capture code also break down long log lines into chunks? This way, it is a little more robust. Otherwise, the ingestion could break again if the log line is over the new limit you set.

able-evening-90828

09/21/2022, 9:17 PM

@gray-shoe-75895 any update on this? I just ran into this issue again when I tried to debug a bigquery UI ingestion. I enabled the debug logs for the ingestion.

gray-shoe-75895

09/22/2022, 12:01 AM

Yes, the fix for this will be included in the next datahub-actions release

able-evening-90828

09/22/2022, 12:16 AM

awesome, when is the next release?

gray-shoe-75895

09/23/2022, 1:11 AM

We just cut a new release for datahub-actions v0.0.8 - it’s still building the docker containers (https://github.com/acryldata/datahub-actions/actions), but please test it out once those are built and pushed and let me know how it goes

gray-shoe-75895

09/23/2022, 1:13 AM

The fix was two fold: (1) increasing the buffer size to 1mb instead of 64kb, and (2) truncating long log lines when sending/displaying in the UI - this was necessary because GMS rejects payloads that are over 4mb in size, but the full logs can still be found in the actions container file

/tmp/datahub/ingestion/<exec id>/ingestion_log.txt

able-evening-90828

09/23/2022, 1:36 AM

Thank you @gray-shoe-75895! So now datahub-actions officially absorbed acryl-executor?

able-evening-90828

09/23/2022, 1:37 AM

Oh, never mind. I saw you still use acryl-executor.

able-evening-90828

09/23/2022, 1:38 AM

The ingestion tmp folder (

/tmp/datahub/ingestion/<exec id>

) is usually deleted after an ingestion run regardless if it succeeds or not. Have you changed this as well? Otherwise, we couldn't see the full logs after the ingestion is done.

gray-shoe-75895

09/23/2022, 4:37 PM

Yep acryl-executor is still separate for now - the intention is to merge it into actions, but we just haven’t gotten around to it

gray-shoe-75895

09/23/2022, 4:38 PM

Ah I think the logs directory may get deleted, but I’ll check and confirm.

gray-shoe-75895

09/23/2022, 4:39 PM

We’ll probably need to build some log retention/autodeletion capabilities if we persist them for a longer period of time, since some of the logs can get quite large

aloof-balloon-41159

01/09/2023, 2:44 PM

Hello!

aloof-balloon-41159

01/09/2023, 2:45 PM

Would it be possible to bump the limit even further?

gray-shoe-75895

01/10/2023, 1:41 AM

What version of datahub-actions do you have? I believe this was fixed in 0.0.8, and the latest is 0.0.9

aloof-balloon-41159

01/30/2023, 5:44 PM

We’re on 0.0.7 - I’ll upgrade! tysm @gray-shoe-75895!

12 Views

Open in Slack

Previous Next