DataHub #ingestion

microscopic-mechanic-13766

10/28/2022, 8:52 AM

Good Friday everyone, so I have executed a Spark application but the lineage of it is quite strange and I hope someone could help me understand what could have been done to obtain to right lineage. So my intention is to obtain from a series of Hive and Postgres tables a new Hive table via a Spark application. The thing is that for some reason the Hive tables are shown like HDFS datasets, which stop me from seeing the same lineage from said Hive tables (View images). I don't know why is it shown like this when in the Spark application what is done is

readHiveTable(<databaseName>, <tableName>)

Thanks in advance !!

high-hospital-85984

10/28/2022, 9:51 AM

It looks like the v0.8.41 docker image is missing dependencies need to execute the Kafka-connect ingestion.

Copy code

docker run -it --user root -v /my/path/recipes:/temp linkedin/datahub-ingestion:v0.8.41 ingest run --dry-run -c /temp/kafka-connect-to-datahub-kafka.yml

[2022-10-28 09:47:42,927] INFO     {datahub.cli.ingest_cli:99} - DataHub CLI version: 0.8.41+docker
[2022-10-28 09:47:42,952] INFO     {datahub.ingestion.run.pipeline:160} - Sink configured successfully. 
[2022-10-28 09:47:44,455] INFO     {datahub.ingestion.source.kafka_connect:866} - Connection to <address> is ok
[2022-10-28 09:47:44,456] ERROR    {datahub.ingestion.run.pipeline:126} - No JVM shared library file (libjvm.so) found. Try setting up the JAVA_HOME environment variable properly.
[2022-10-28 09:47:44,456] INFO     {datahub.cli.ingest_cli:115} - Starting metadata ingestion
[2022-10-28 09:47:44,456] INFO     {datahub.cli.ingest_cli:133} - Finished metadata pipeline

Failed to configure source (kafka-connect) due to No JVM shared library file (libjvm.so) found. Try setting up the JAVA_HOME environment variable properly.
No ~/.datahubenv file found, generating one for you...

Possibly related issue: https://github.com/datahub-project/datahub/issues/4741

prehistoric-fireman-61692

10/28/2022, 10:35 AM

Cross-channel post (think this might be a better channel) Hi all I’ve set DataHub up via the docker quickstart but when trying to ingest metadata I get the following error message “ERROR {datahub.entrypoints:165} - Cannot open config file” - this is when trying to ingest via UI or CLI - any clue what the problem is or a fix?

jolly-football-89638

10/28/2022, 12:05 PM

Hey, random question. I am trying to profile some rather large tables / views in Snowflake. Running with 5 workers against 2 x-large warehouses, it looks like it is going to take over 24 hours to complete (I have never let it go that far). I should also note that I edited sampler.py to hardcode a max of 10k rows per table. Has anyone tried something like this on snowflake? Any suggestions on how to speed it up? Would more workers against the same warehouse help? I really don't want to spend over 700 credits just to profile 1 schema of my schemas. That is not a cost that I can incur on a regular basis.

full-chef-85630

10/28/2022, 1:23 PM

Hi all, We use the airflow cluster mode (Celery) to perform the ingestion. The data source is bigquery, and some exceptions occur. @dazzling-judge-80093

Copy code

"""dag name: social-insights"""
from datetime import timedelta, datetime
from airflow import DAG
from airflow.utils.dates import days_ago
from datahub.configuration.config_loader import load_config_file
from datahub.ingestion.run.pipeline import Pipeline

try:
    from airflow.operators.python import PythonOperator
except ModuleNotFoundError:
    from airflow.operators.python_operator import PythonOperator

default_args = {
    "owner": "airflow",
    "depends_on_past": False,
    "email": "<http://xxxx.com|xxxx.com>",
    "email_on_failure": False,
    "email_on_retry": False,
    "retries": 1,
    "retry_delay": timedelta(minutes=5),
    "execution_timeout": timedelta(minutes=120),
}


def template():
    """Run ingestion job: social-insights"""
    config = load_config_file("/social_insights.yaml")
    pipeline = Pipeline.create(config)
    pipeline.run()
    pipeline.raise_from_status()


with DAG(
    dag_id="social-insights",
    schedule_interval=timedelta(hours=1),
    start_date=days_ago(2),
) as dag:
    PythonOperator(
        task_id="social-insights",
        python_callable=template,
    )

airflow error info

Copy code

. It will be skipped from lineage. The error was daemonic processes are not allowed to have children.

Airflow of a single node is running normally

little-spring-72943

10/28/2022, 4:02 PM

since upgrade 0.9.0 assertionRunEvent events are not emitting using emit_mcp or emit python functions. No errors are returned. All other events are emitting fine. Does anyone else have same behaviour?

crooked-holiday-47153

10/31/2022, 8:27 AM

Hi All, Question: We are using DataHub to ingest data from Snowflake and Tableau. The tables we ingest from Snowflake are a copy of the prod tables which resides in MySql (we have an etl set in place to sync MySql tables into Snowflake). In Snowflake the tables resides under schema X while in MySql the schema has a different name, lets call it schema Y, but the tables' names are the same in both places. Is there a way using the MySql ingestion to combine the metadata coming from MySql into the existing tables (e.g. existing urns) we already ingested from Snowflake? 10x in advance, Eyal

silly-intern-25190

10/31/2022, 8:35 AM

Hi All , I am working on vertica datahub plugin , I wanted to ingest creation time for each table and views , where will be a suitable space for it ? Can we put it in stats ?

gifted-knife-16120

10/31/2022, 10:35 AM

Hi All, tried to ingest data from Metabase, but get this error

Copy code

['Platform was not found in DataHub. Using postgres name as is'],\n"
           "              'metabase-dbname-2': ['Cannot determine database name for platform: postgres'],\n"
           "              'metabase-platform-3': ['Platform was not found in DataHub. Using postgres name as is'],\n"
           "              'metabase-dbname-3': ['Cannot determine database name for platform: postgres'],\n"
           "              'metabase-platform-1': ['Platform was not found in DataHub. Using h2 name as is'],\n"
           "              'metabase-dbname-1': ['Cannot determine database name for platform: h2']},\n"

anyone can help?

careful-action-61962

10/31/2022, 10:58 AM

Hey guys, trying ui based ingestion but it is giving, earlier it was running fine.. suddenly started getting this.

Copy code

~~~~ Execution Summary ~~~~

RUN_INGEST - {'errors': [],
 'exec_id': '2aa01dec-ff3a-4093-af01-9538d1fae92c',
 'infos': ['2022-10-30 18:30:00.193001 [exec_id=2aa01dec-ff3a-4093-af01-9538d1fae92c] INFO: Starting execution for task with name=RUN_INGEST',
           '2022-10-30 18:30:00.193303 [exec_id=2aa01dec-ff3a-4093-af01-9538d1fae92c] INFO: Caught exception EXECUTING '
           'task_id=2aa01dec-ff3a-4093-af01-9538d1fae92c, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 113, in execute_task\n'
           '    task_event_loop = asyncio.new_event_loop()\n'
           '  File "/usr/local/lib/python3.10/asyncio/events.py", line 782, in new_event_loop\n'
           '    return get_event_loop_policy().new_event_loop()\n'
           '  File "/usr/local/lib/python3.10/asyncio/events.py", line 673, in new_event_loop\n'
           '    return self._loop_factory()\n'
           '  File "/usr/local/lib/python3.10/asyncio/unix_events.py", line 64, in __init__\n'
           '    super().__init__(selector)\n'
           '  File "/usr/local/lib/python3.10/asyncio/selector_events.py", line 53, in __init__\n'
           '    selector = selectors.DefaultSelector()\n'
           '  File "/usr/local/lib/python3.10/selectors.py", line 350, in __init__\n'
           '    self._selector = self._selector_cls()\n'
           'OSError: [Errno 24] Too many open files\n']}
Execution finished with errors.

alert-fall-82501

10/31/2022, 1:05 PM

Hi Team - Working on datahub action framework . I am having below error after the action framework running .

alert-fall-82501

10/31/2022, 1:06 PM

Copy code

%4|1667221196.244|FAIL|rdkafka#consumer-1| [thrd:<http://datahub-sbx2-frontend.amer-dev.XXXX.com:9092/bootstra|datahub-sbx2-frontend.amer-dev.XXXX.com:9092/bootstra>]: <http://datahub-sbx2-frontend.amer-dev.XXXX.com:9092/bootstrap|datahub-sbx2-frontend.amer-dev.XXXX.com:9092/bootstrap>: Connection setup timed out in state CONNECT (after 30090ms in state CONNECT)

alert-fall-82501

10/31/2022, 1:06 PM

Can anybody suggest on this ?

bland-nail-65199

10/31/2022, 2:10 PM

Hi team, I'm testing the csv enricher for metadata using a yml recipe and a csv file. Received the following error: KeyError: ‘Did not find a registered class for csv-enricher’ I didn't do the deployment by myself and the DataHub the team deployed has CLI version 0.8.28.1. Could this error because we're using a older version of DataHub or is it because of our configuration? Much thanks for the help!

astonishing-pager-27015

10/31/2022, 3:37 PM

Is it true that the new bigquery connector only supports table-level profiling? The sample recipe in the docs has this:

Copy code

profiling:
      enabled: true
      profile_table_level_only: true

but setting up ingestion in the UI with

Enable Profiling

checked, I only see this in the YAML view:

Copy code

profiling:
      enabled: true

When I run it, no profiling seems to occur, though everything else works fine. edit: I needed to change the profiling settings that were limited based on table change recency, row count, and size.

✅ 1

purple-sugar-36357

10/31/2022, 4:55 PM

Hello everyone, when we are ingesting data from S3 how can we add link back to that asset that we just ingested? It does it somehow through a Tableau connector – kind of trace the lineage back to Tableau report or chart. But, for S3 is there a way where we can link it back to the S3 location so that it can open it up directly? If we click on add link in the documentation section of the UI and paste in the link to the actual location of that file, and when we pull up that asset - the add link option will bring us to the location of that file, but is there a way to do that automatically via ingestion for S3? Thank you.

early-hydrogen-27542

10/31/2022, 5:51 PM

Hi everyone! We are trying to determine how datahub determines field level usage. Where is the code that defines`DatasetFieldUsageCountsClass`? I can see it's imported here, but I haven't found the code behind that class. In addition to this, are there any other key places in the source code that would clue us in to how the field level usage is determined and/or where it ends up?

limited-forest-73733

10/31/2022, 7:33 PM

Hey team, is 0.9.1 datahub release has fix for sqlalchemy , i mean is this compatible with airflow 2.3.x?Any suggestions. Thanks in advance

witty-microphone-40893

10/31/2022, 10:51 PM

Hi, I've just run an ingestion against a production db and have realised that the 'sample' data collected by Datahub includes PII data. How do I prevent sampling of, or mask, pii fields?

damp-ambulance-34232

11/01/2022, 4:19 AM

Hi, How to add Glossary Term Group and Glossary Term from UI or from Ingestion button?

alert-fall-82501

11/01/2022, 10:28 AM

Hi Team -Just quick question , I am adding description to Table field in datahub , There are some table has same field and I don't wanted manually enter the description for other table details . Can we do this through CSV sources or is there any ways to do this ? Please suggest on this

bitter-elephant-29459

11/01/2022, 2:27 PM

Hello, Pretty new to datahub. Could anyone please tell if datahub - redshift integration fetches the comments on tables and columns? https://docs.aws.amazon.com/redshift/latest/dg/r_COMMENT.html

silly-finland-62382

11/01/2022, 6:36 PM

Hey , Can someone tell me command to delete the subfolder in spark like I have spark lineage platform on datahub inside prod env, I want to delete sub folder (like box) inside spark ?

eager-lifeguard-22029

11/01/2022, 7:36 PM

Is it possible to upload a schema to datahub via a CSV file (or similar) that represents table columns? Basically not connecting an actual data store but uploading metadata schema manually (preferably through the UI)

witty-television-74309

11/01/2022, 8:14 PM

This could be very basic question. I am ingesting snowflake tables. In the recipe profiling is enabled. but I am not able to see row level statistics. Even the stats tab is disabled in the UI. Do you know which steps am i missing ? I see table schema.

mammoth-fountain-69052

11/02/2022, 4:55 AM

Hi, Not able to connect to clickhouse on premise server... Below is the execution summary. Please can anyone assist me with the steps I am missing...

limited-forest-73733

11/02/2022, 9:34 AM

Hey team, i am not able to enable profiling This is the recipe we are using

limited-forest-73733

11/02/2022, 10:57 AM

Hey i am not able to ingest snowflake tables to datahub with version 0.8.45.2

flaky-soccer-57765

11/02/2022, 12:38 PM

Hi All, I have a MSSQL server ingestion source and I need to ingest data from multiple database within in the source. Do I need individual recipe file for each one of these databases ? Can I have this all in a single recipe? Thanks.

thankful-ram-70854

11/02/2022, 4:08 PM

Hi All, I'm trying to ingest metadata from tableau using the built-in Tableau source but I'm getting "0 assets ingested". My Tableau projects structure is nested. e.g.: • Marketing ◦ Data Sources ◦ Dashboards • Sales ◦ Data Sources ◦ Dashboards does the Tableau source supports nested projects ? or is there a way to specify the leaf projects in the config ? Thanks