Hi all and happy Friday We are doing some POC with DataHub s DataHub #ingestion

Hi all and happy Friday! We are doing some POC with DataHub specifically ingesting table lineage from Redshift and realized that it's also automatically ingests upstream s3 COPY lineage, which is very interesting & useful. However, we have many event level table that gets too many small batches of s3 files loaded using Kinesis firehose so the S3 lineage looks extremely verbose and even seem to completely break ingestion sometimes

Copy code

'  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 96, in _read_output_lines\n'
           '    line_bytes = await ingest_process.stdout.readline()\n'
           '  File "/usr/local/lib/python3.9/asyncio/streams.py", line 549, in readline\n'
           '    raise ValueError(e.args[0])\n'
           'ValueError: Separator is found, but chunk is longer than limit\n']}

I see a related thread here https://datahubspace.slack.com/archives/CUMUWQU66/p1663143783318239 Is there way to exclude the upstream S3 lineage collection for certain redshift tables? So far I had to exclude such tables with extensive s3 upstream otherwise ingestion doesn't work. I'm using

v0.8.44

and datahub actions

0.0.7

Thanks!

dazzling-judge-80093

09/26/2022, 7:02 AM

I think you can disable copy lineage with setting

include_copy_lineage

to false in your source config

gray-shoe-75895

09/27/2022, 12:34 AM

on the

Separator is found, but chunk is longer than limit

error, that should be fixed in datahub actions 0.0.8

fresh-nest-42426

09/27/2022, 12:36 AM

Thanks all ! Is there way to specify selective

include_copy_lineage

now or sometime in the future?

gray-shoe-75895

09/27/2022, 1:09 AM

Not at the moment, but we’d be happy to accept a contribution!

Open in Slack

Previous Next