DataHub #ingestion

great-branch-515

09/17/2022, 5:06 AM

@here Can anyone from Datahub team help me this MR https://github.com/datahub-project/datahub/pull/5968 related to issue https://github.com/datahub-project/datahub/issues/5872

full-chef-85630

09/18/2022, 2:25 PM

Hi all! I’m using Ingestion Transformers，using “Add Dataset datasetProperties”，Have you considered ’get_ properties_ to_ add ()‘method, bring the parameter’ PipelineContext ’. It seems that I can expand at will @dazzling-judge-80093

clean-tomato-22549

09/19/2022, 4:01 AM

Hello. I am not sure whether this is the correct place to ask. I ingest my looker with base url like https://company.looker.com:19999 While, my looker access url is https://company.looker.com:9999. How can I make the UI looker dashboard link work? Currently, it is linked to https://company.looker.com

without 9999

. I tried to provide

external_base_url

parameter, but it is not work.

magnificent-lock-58916

09/19/2022, 5:49 AM

Hello! We’re using Tableau ingestion and faced some inconveniences • After ingestion, renaming anything in Tableau lead to duplicates. For example, renaming “Chart 1” to “Chart 1 (test)” will end up in two charts in datahub, named “Chart 1” and “Chart 1 (test)“, while the expected behaviour is to have one chart named “Chart 1 (test)“, same as in Tableau • Renaming datasources end up in the same problem as above • If you delete workbook in Tableau, it won’t disappear in DataHub automatically Is there a way to fix this? Renaming problem is most concerning, since a lot of people contribute in our Tableau and it may end up in a lot of duplicates with different names.

creamy-pizza-80433

09/19/2022, 7:10 AM

Hello! I'm new to datahub and I want to ask how to automate data lineage outside of apache airflow, in our case we're using Informatica to handle all of our ETL pipeline and is there a way to automate the lineage between the tasks. Thank you!

better-actor-97450

09/19/2022, 10:10 AM

I ingestion source Postgre but can't ingestion all database on DB , just only one of schema on DB , i think if set null on config database then dataHub will get all database on Postgres. where is my wrong ?

rich-machine-24265

09/19/2022, 10:34 AM

Hi! I have a contribution proposal, and would like to know if it would be useful. We're ingesting a lot of entities, more than 500k, using stateful ingestion - thus checkpoint state become huge, about 50Mb, and it is crucial to use compression. There is a warning that compression could not be turned on https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/state/checkpoint.py#L42 . If I understood right, reason is that aspect should be represented as string, not byte array. So, in our custom CheckpointState I compress data with bz2, then encode it to string with base85. Result is about 5 times less than aspect's json representation. I'm wondering if this approach would be useful in CheckpointStateBase#to_bytes method, i.e. make something like

base64.b85encode(bz2.compress(pickle.dumps(self)))

instead of

json_str_self.encode("utf-8")

here https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/state/checkpoint.py#L49 ? Thank you!

square-winter-39825

09/19/2022, 3:03 PM

Hi, I am trying to configure spark lineage ingestion from databricks clusters. I added

Copy code

spark.jars.packages io.acryl:datahub-spark-lineage:0.8.44-3

as external jar to the cluster. I am looking to add extra listener and URL. Can somebody please provide the steps if they have configure spark-lineage on databricks clusters? Thanks!

swift-nail-32514

09/19/2022, 4:43 PM

Hi there, we're seeing that

Databricks

sources are being listed with the Platform type of

Hive

and

Synapse

sources are being listed as

MSSQL

because those are the plugins that need to be used for ingestion. Is there a way to force a differentiation that we're just not using? cc team members @brave-pencil-21289 @great-optician-81135

agreeable-farmer-44067

09/19/2022, 6:18 PM

Hi all. I try to connect my Nifi, but its impossible for me. I used docker but through custom yaml is not working and by CLI in the docker container neither. Could you help me with this? Thanks! :)

alert-fall-82501

09/20/2022, 9:22 AM

Copy code

Failed to create source due to Protocol message FieldDescriptorProto has no "proto3_optional" field.

alert-fall-82501

09/20/2022, 9:22 AM

Getting this error when trying ingest metadata from bigquery to datahub

alert-fall-82501

09/20/2022, 9:22 AM

can anybody help me on this ? Thanks in advance

clean-tomato-22549

09/20/2022, 9:46 AM

Hi team, can this parameter works for presto on hive?

profiling.partition_datetime

According to the doc https://datahubproject.io/docs/generated/ingestion/sources/presto-on-hive It is

Only Bigquery supports this.

Is their plan to support the parameter for presto on hive?

microscopic-mechanic-13766

09/20/2022, 12:10 PM

Hello, so I have ingested a few times from Hive while the profiling enabled. Everything has executed perfectly but for the duration of the proccess. One time it took 1569s (which is equivalent to 26 mins), but the last time it took it 3311s (which is almost an hour). I have been trying to increase property

max_workers

but haven't improve the times. I don't think it is a problem with either the quantity of data (as I only have 4 tables, and the maximum number of rows in them is 30) or my deployment as it didn't use to be this slow. Any tips how to either improve the ingestion or to determine the actual cause of the problem?? (The profiling in other sources is normal, so it isn't a problem with either the ingestion or profiling, but a problem with Hive ingestion)

brave-pencil-21289

09/20/2022, 12:27 PM

Can we connect to oracle using easy connect string using cx_oracle. Can you someone share a sample recipe code on this.

alert-fall-82501

09/20/2022, 12:43 PM

Hi Team - I am ingesting metadata from elastic search to the datahub . I am having below while ingesting

alert-fall-82501

09/20/2022, 12:43 PM

Copy code

022-09-20, 12:32:59 UTC] {subprocess.py:89} INFO -     raise NewConnectionError(
[2022-09-20, 12:32:59 UTC] {subprocess.py:89} INFO - urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x7f98aff4f940>: Failed to establish a new connection: [Errno -2] Name or service not known
[2022-09-20, 12:32:59 UTC] {subprocess.py:89} INFO - [2022-09-20, 12:32:59 UTC] WARNING  {elasticsearch:293} - GET <https://vpc-prod-test-2-zzshies2gij47gfl4x5sehcebe.eu-central-1.es.amazonaws.com:443/_alias> [status:N/A request:0.803s]
[2022-09-20, 12:32:59 UTC] {subprocess.py:89} INFO - Traceback (most recent call last):
[2022-09-20, 12:32:59 UTC] {subprocess.py:89} INFO -   File "/usr/local/lib/python3.8/site-packages/urllib3/connection.py", line 174, in _new_conn
[2022-09-20, 12:32:59 UTC] {subprocess.py:89} INFO -     conn = connection.create_connection(
[2022-09-20, 12:32:59 UTC] {subprocess.py:89} INFO -   File "/usr/local/lib/python3.8/site-packages/urllib3/util/connection.py", line 73, in create_connection
[2022-09-20, 12:32:59 UTC] {subprocess.py:89} INFO -     for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
[2022-09-20, 12:32:59 UTC] {subprocess.py:89} INFO -   File "/usr/local/lib/python3.8/socket.py", line 918, in getaddrinfo
[2022-09-20, 12:32:59 UTC] {subprocess.py:89} INFO -     for res in _socket.getaddrinfo(host, port, family

cool-vr-73109

09/20/2022, 3:54 PM

How can i enable queries tab for snowflake ingestion?

cool-vr-73109

09/20/2022, 4:03 PM

Tried to add top_n_queries parameter for snowflake type configuration but got error like 'extra fields not permitted'

cool-vr-73109

09/20/2022, 4:04 PM

For snowflake-usage-legacy getting below error

shy-lion-56425

09/20/2022, 4:30 PM

Hi all. I've recently setup datahub on gcp via: https://datahubproject.io/docs/deploy/gcp/ I've been able to ingest Big query data, but haven't been able to get S3 data lake to work. Here's the example yaml:

Copy code

source:
    type: s3
    config:
        profiling:
            enabled: false
        path_specs:
            -
                include: '<s3://MY_EXAMPLE_BUCKET/AWSLogs/0123456789/CloudTrail/us-east-1/2022/08/23/*.*>'
            -
                enable_compresion: true
        aws_config:
            aws_access_key_id: '${AWS_ACCESS_KEY_ID_CLOUDTRAIL}'
            aws_region: us-east-1
            aws_secret_access_key: '${AWS_SECRET_ACCESS_KEY_CLOUDTRAIL}'

However I get the following error:

Copy code

~~~~ Execution Summary ~~~~

RUN_INGEST - {'errors': [],
 'exec_id': '9e0f190f-05fd-407c-bdb9-16cebaed1d0c',
 'infos': ['2022-09-20 16:24:16.151405 [exec_id=9e0f190f-05fd-407c-bdb9-16cebaed1d0c] INFO: Starting execution for task with name=RUN_INGEST',
           '2022-09-20 16:24:18.213813 [exec_id=9e0f190f-05fd-407c-bdb9-16cebaed1d0c] INFO: stdout=Elapsed seconds = 0\n'
           '  --report-to TEXT                Provide an output file to produce a\n'
           'This version of datahub supports report-to functionality\n'
           'datahub --debug ingest run -c /tmp/datahub/ingest/9e0f190f-05fd-407c-bdb9-16cebaed1d0c/recipe.yml --report-to '
           '/tmp/datahub/ingest/9e0f190f-05fd-407c-bdb9-16cebaed1d0c/ingestion_report.json\n'
           '[2022-09-20 16:24:17,736] INFO     {datahub.cli.ingest_cli:170} - DataHub CLI version: 0.8.43.2\n'
           '[2022-09-20 16:24:17,769] INFO     {datahub.ingestion.run.pipeline:163} - Sink configured successfully. DataHubRestEmitter: configured '
           'to talk to <http://datahub-datahub-gms:8080>\n'
           "[2022-09-20 16:24:17,770] ERROR    {datahub.ingestion.run.pipeline:127} - s3 is disabled; try running: pip install 'acryl-datahub[s3]'\n"
           'Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 85, in _ensure_not_lazy\n'
           '    plugin_class = import_path(path)\n'
           '  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 32, in import_path\n'
           '    item = importlib.import_module(module_name)\n'
           '  File "/usr/local/lib/python3.9/importlib/__init__.py", line 127, in import_module\n'
           '    return _bootstrap._gcd_import(name[level:], package, level)\n'
           '  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import\n'
           '  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load\n'
           '  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked\n'
           '  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked\n'
           '  File "<frozen importlib._bootstrap_external>", line 850, in exec_module\n'
           '  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed\n'
           '  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/s3/__init__.py", line 1, in <module>\n'
           '    from datahub.ingestion.source.s3.source import S3Source\n'
           '  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/source/s3/source.py", line 10, in <module>\n'
           '    import pydeequ\n'
           "ModuleNotFoundError: No module named 'pydeequ'\n"
           '\n'
           'The above exception was the direct cause of the following exception:\n'
           '\n'
           'Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 172, in __init__\n'
           '    source_class = source_registry.get(source_type)\n'
           '  File "/usr/local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 127, in get\n'
           '    raise ConfigurationError(\n'
           "datahub.configuration.common.ConfigurationError: s3 is disabled; try running: pip install 'acryl-datahub[s3]'\n"
           '[2022-09-20 16:24:17,773] INFO     {datahub.cli.ingest_cli:119} - Starting metadata ingestion\n'
           '[2022-09-20 16:24:17,774] INFO     {datahub.cli.ingest_cli:137} - Finished metadata ingestion\n'
           "[2022-09-20 16:24:17,919] ERROR    {datahub.entrypoints:188} - Command failed with 'Pipeline' object has no attribute 'source'. Run with "
           '--debug to get full trace\n'
           '[2022-09-20 16:24:17,920] INFO     {datahub.entrypoints:191} - DataHub CLI version: 0.8.43.2 at '
           '/usr/local/lib/python3.9/site-packages/datahub/__init__.py\n',
           "2022-09-20 16:24:18.214118 [exec_id=9e0f190f-05fd-407c-bdb9-16cebaed1d0c] INFO: Failed to execute 'datahub ingest'",
           '2022-09-20 16:24:18.214380 [exec_id=9e0f190f-05fd-407c-bdb9-16cebaed1d0c] INFO: Caught exception EXECUTING '
           'task_id=9e0f190f-05fd-407c-bdb9-16cebaed1d0c, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    self.event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.9/site-packages/nest_asyncio.py", line 89, in run_until_complete\n'
           '    return f.result()\n'
           '  File "/usr/local/lib/python3.9/asyncio/futures.py", line 201, in result\n'
           '    raise self._exception\n'
           '  File "/usr/local/lib/python3.9/asyncio/tasks.py", line 256, in __step\n'
           '    result = coro.send(None)\n'
           '  File "/usr/local/lib/python3.9/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 142, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"]}
Execution finished with errors.

cool-vr-73109

09/20/2022, 4:12 PM

IMG_20220920_213459.jpg

creamy-pizza-80433

09/20/2022, 7:04 AM

Hello, I tried to use Transformers to add tags and glossary terms before the metadata hit the datahub sinks but it seems there are problem when I want to edit the description of the glossary terms. It said that the Glossary Term does not exist. Does it mean I have to create the glossary terms before I put it in Transformers? And how do I create a custom name from Datahub UI for my glossary term instead of this random string? Thank you!

lemon-cat-72045

09/21/2022, 6:38 AM

Hi, everyone. I have integrated Airflow pipelines with datahub and can see the run history of DAGs within datahub. I am wondering if we can set a retention time for DAGs run history within datahub so we don't store all the run history within datahub. Thanks!

dry-hair-98162

09/21/2022, 7:28 AM

Hi everyone, can anyone explain the difference between these three SSL connections? And how does each one work with MySQL ingestion? Thank you

magnificent-lock-58916

09/21/2022, 10:24 AM

Hello! I have additional question about Tableau integration and the fact that it doesn’t have stateful ingestion Let’s say I have already imported Datasource named “A” from Tableau to Datahub. After that, metadata of this datasource changed, for example we deleted some fields. But the datasource is still named “A” With next ingestion execution, will Datahub update already imported entity’s metadata?

fancy-alligator-33404

09/21/2022, 1:57 PM

Hi, Everyone! I have a question about ingestion through hive. When I have view tables in hive, ingestion process succeeded, but I couldn't see tables in datahub. Could anyone help solve this problem??? T^T

delightful-barista-90363

09/21/2022, 2:26 PM

Hello, haven't seen any updates regarding this PR, just wonderin if theres gonna be any updates soon? My team would get a pretty quick benefit from it

silly-finland-62382

09/21/2022, 4:34 PM

Hey, We are not able to see any table in upstream source in Data Spark Lineage when we run spark command Can someone help on this, I check in backend it using AppendData Logical Plan ,