Hi all and happy Friday! We are doing some POC wi...
# ingestion
f
Hi all and happy Friday! We are doing some POC with DataHub specifically ingesting table lineage from Redshift and realized that it's also automatically ingests upstream s3 COPY lineage, which is very interesting & useful. However, we have many event level table that gets too many small batches of s3 files loaded using Kinesis firehose so the S3 lineage looks extremely verbose and even seem to completely break ingestion sometimes
Copy code
'  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 96, in _read_output_lines\n'
           '    line_bytes = await ingest_process.stdout.readline()\n'
           '  File "/usr/local/lib/python3.9/asyncio/streams.py", line 549, in readline\n'
           '    raise ValueError(e.args[0])\n'
           'ValueError: Separator is found, but chunk is longer than limit\n']}
I see a related thread here https://datahubspace.slack.com/archives/CUMUWQU66/p1663143783318239 Is there way to exclude the upstream S3 lineage collection for certain redshift tables? So far I had to exclude such tables with extensive s3 upstream otherwise ingestion doesn't work. I'm using
v0.8.44
and datahub actions
0.0.7
Thanks!
d
I think you can disable copy lineage with setting
include_copy_lineage
to false in your source config
g
on the
Separator is found, but chunk is longer than limit
error, that should be fixed in datahub actions 0.0.8
f
Thanks all ! Is there way to specify selective
include_copy_lineage
now or sometime in the future?
g
Not at the moment, but we’d be happy to accept a contribution!