DataHub #ingestion

Hello, team ! We are doing DataHub POC testing and as part of it , using Athena , Glue and S3 as Ingestion recipes to compare side by side on the capabilities. we are more interested in stats like rowcount. sample data, min,max ,average etc Deployed DataHub on EKS cluster. EKS cluster pod uses a certain IAM role and I used this code in Athena Recipe so that it can assume the role i defined in recipe but 1) Assume role is not happening. It brings back metadata though with the instance IAM role. 2) stats do not get updated even though profiling is enabled 1. Are you using UI or CLI for ingestion? UI 2. Which DataHub version are you using? (e.g. 0.12.0) 0.12.0 3. What data source(s) are you integrating with DataHub? (e.g. BigQuery) Athena, Glue, S3

Copy code

source:
    type: athena
    config:
        aws_region: us-east-1
        work_group: primary
        s3_staging_dir: '<s3://datahubpoc-data/athena-results/>'
        catalog_name: datahubpoc-gluecatalog
        aws_role_arn: 'arn:aws:iam::<awsaccountid>:role/test-datahubec2-poc-role'
        profiling:
            enabled: 'True'

bulky-island-74277

03/04/2024, 3:10 AM

Hello DataHub community!Today I ran into an issue while using datahub, when I clicked on Lineage, it said "an unexpected error occurred". What is the problem and how can I fix it?

gifted-diamond-19544

03/04/2024, 8:58 AM

Hi all! I have a question regarding

Athena

ingestion. I was looking into the permissions and it seems that Datahub needs permissions to run queries on Athena, as well as getting objects from S3. Are these permissions necessary if I just want to ingst metadata from Athena (meaning, no profiling)?

✅ 1

bland-application-65186

03/04/2024, 10:01 AM

Hi All, in https://datahubproject.io/docs/generated/ingestion/sources/s3/,

<s3://my-bucket/{dept}/tests/{table}/*.avro>

# specifying keywords to be used in display name
whats the expected result of using

{dept}

purple-addition-48342

03/04/2024, 8:29 PM

Hello everyone. I am using Datahub v0.12.0 from Helm Chart with GCP. I would like to use custom code inside the action container, to e.g. add custom transforms or other custom code with the UI ingestion. What is the best way to "inject" this custom code ... Actually I would like to prevent creating a custom image, which needs to be updated every time datahub is updated. I thought about injecting files via config maps, e.g. overwrite the ingestion_common.sh to add custom packages .... is there a better way ? Thx for help

boundless-bear-68728

03/05/2024, 1:41 AM

Hi Team, Can you please suggest what should be the recommended memory that should be assigned to

datahub-action

servce. Currently, I have assigned 6Gi with max up to 8Gi but still I could see that the service is consuming around 7.6Gi of memory and during this time the application UI renders inactive. Is there any resolution to this issue? Currently, I am trying to ingest metadata for just 1 Snowflake DB with all advanced options turned on. Do I need to cut down on the number of schema I am trying to ingest the data or should I push to

datahub-action

for more memory?

elegant-salesmen-99143

03/05/2024, 8:07 AM

Hi everyone. I have a question about the note that the

env

parameter is about to be deprecated. It said use

platform_instance

instead. But it looks like

platform_instance

is for different use cases and works differently. For example, I had a recipe that had

env: STG

. I tried replacing it with with

platform_instance: STG

, but now when I look at database structure, I have a container PROD on upper level (

PROD

is the default value for

env

), and inside it I have STG container with my database. Is that the expected behavior? Environment is the same thing as instance, how do I specify the environment now? After

env

is deprecated, what will happen to the databases that have PROD as the default value for env, not specified in recipe? Will they behave differently from those where

env: PROD

is specified in recipe? I did it while on Datahub 12.1, I haven't upgraded to 13.0 yet, I wanted to try using replacing the

env

first.

able-jelly-63005

03/05/2024, 9:24 AM

I am running a bigquery ingestion while running if i enable the column level profiling, its failing and showing exception in logs ERROR {datahub.entrypoints:201} - Command failed: 'Cursor' object has no attribute '_query_job' Traceback (most recent call last): File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/entrypoints.py", line 188, in main sys.exit(datahub(standalone_mode=False, **kwargs)) File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/click/core.py", line 1157, in call return self.main(*args, **kwargs) File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/click/decorators.py", line 33, in new_func return f(get_current_context(), *args, **kwargs) File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 448, in wrapper raise e File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 397, in wrapper res = func(*args, **kwargs) File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/utilities/memory_leak_detector.py", line 95, in wrapper return func(ctx, *args, **kwargs) File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 197, in run ret = loop.run_until_complete(run_ingestion_and_check_upgrade()) File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result() File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 181, in run_ingestion_and_check_upgrade ret = await ingestion_future File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 139, in run_pipeline_to_completion raise e File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 131, in run_pipeline_to_completion pipeline.run() File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 377, in run for wu in itertools.islice( File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 118, in auto_stale_entity_removal for wu in stream: File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 142, in auto_workunit_reporter for wu in stream: File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 224, in auto_browse_path_v2 for urn, batch in _batch_workunits_by_urn(stream): File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 362, in _batch_workunits_by_urn for wu in stream: File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 155, in auto_materialize_referenced_tags for wu in stream: File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 70, in auto_status_aspect for wu in stream: File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/bigquery.py", line 551, in get_workunits_internal yield from self._process_project(project_id) File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/bigquery.py", line 670, in _process_project yield from self.profiler.get_workunits( File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/profiler.py", line 184, in get_workunits yield from self.generate_profile_workunits( File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/ingestion/source/sql/sql_generic_profiler.py", line 103, in generate_profile_workunits for ge_profiler_request, profile in ge_profiler.generate_profiles( File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 944, in generate_profiles yield async_profile.result() File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.__get_result() File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 987, in _generate_profile_from_request return request, self._generate_single_profile( File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 1042, in _generate_single_profile bigquery_temp_table = create_bigquery_temp_table( File "/tmp/datahub/ingest/venv-bigquery-0.12.0/lib/python3.10/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 1227, in create_bigquery_temp_table ] = cursor._query_job AttributeError: 'Cursor' object has no attribute '_query_job'. Did you mean: 'query_job'?

few-accountant-12561

03/05/2024, 1:20 PM

Hi all! Please tell me if it is necessary to create a connection to the airflow in the datahub. What should the connection recipe look like? I mean Ui connection

few-piano-98292

03/05/2024, 7:24 PM

I have been trying to capture lineage information from a spark Databricks notebook, however it appears information related to the notebook run such as appName, startedAt, description and queryPlan is populated in the Properties page of that Task. No details related to the lineage of this output dataset written to s3, shows up in the Lineage page of this task Could someone help understand what is it that is missing in our configs/setup that would prevent the lineage details from being displayed on the frontend. Here is the datahub and spark version DataHub Version: 0.12.0 Apache Spark Version : 3.2.1[on Databricks -- Databricks LTS 10.4] Scala Version : 2.12

boundless-bear-68728

03/05/2024, 10:53 PM

Hi All. I am facing the following error while I am trying to ingest Looker metadata information:

Copy code

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/acryl/executor/dispatcher/default_dispatcher.py", line 30, in dispatch_async
    res = executor.execute(request)
  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/reporting_executor.py", line 94, in execute
    self._datahub_graph.emit_mcp(completion_mcp)
  File "/usr/local/lib/python3.10/site-packages/datahub/emitter/rest_emitter.py", line 245, in emit_mcp
    self._emit_generic(url, payload)
datahub.configuration.common.OperationalError: ('Unable to emit metadata to DataHub GMS', {'message': 'HTTPConnectionPool(host=\'datahub-datahub-gms\', port=8080): Max retries exceeded with url: /aspects?action=ingestProposal (Caused by ReadTimeoutError("HTTPConnectionPool(host=\'datahub-datahub-gms\', port=8080): Read timed out. (read timeout=30)"))'})
2024-03-05T22:44:20.989419034Z

Can you please help me with this issue

fresh-river-19527

03/06/2024, 12:23 PM

Hi, is there any way to filter the notifications being published into Slack? For example, when I set up the Airflow connector, either with a personal token or with the Datahub user token, for every metadata event that the Airflow plugin publishes, we receive a notification in the slack channel, which can become very spammy when having lots of Dags running

some-alligator-9844

03/06/2024, 2:43 PM

Hi Team/@gray-shoe-75895, I am doing CLI based ingestion for Hive Sources. The ingestion is failing after encountering a single exception from source for a dataset/table. In the earlier version it used to continue and try for next tables. 1. Is there any configuration to continue on error? 2. The ingestion gets stuck for hours if there is no response from the source. is it possible to have a timeout per dataset? Datahub CLI version: 0.12.1.3

some-alligator-9844

03/06/2024, 2:46 PM

Hi Team/@gray-shoe-75895, During cli ingestion I am getting this warning. What should I do to make correct this? I am already using platform_instance

Copy code

['env is deprecated and will be removed in a future release. Please use platform_instance instead.']

recipe.yaml

Copy code

source:
  type: hive
  config:
    platform_instance: ANA.OCE.DEV
    env: DEV
    host_port: 'xxxxxxx.visa.com:10000'
    username: xxxxxxx
    options:
      connect_args:
        auth: KERBEROS
        kerberos_service_name: hive
sink:
  type: datahub-rest
  config:
    server: '${DATAHUB_GMS_HOST}'
    token: '${DATAHUB_GMS_TOKEN}'
    max_threads: 1

Datahub CLI version: 0.12.1.3

happy-branch-193

03/06/2024, 3:12 PM

Hi guys, just had a question. For LookML ingestions, how is it possible to enable to show all the nested lookml views (specifically defined with extends keyword) in the lineage? Not just the last used for the explore? This is UI ingestion. DataHub CLI version: 0.12.1.5. Looker/LookML.

incalculable-sundown-8765

03/06/2024, 7:18 PM

Hi guys, I have a question on

datahub delete

. I want to hard delete everything related to redshift. However, I encounter this issue:

Copy code

% datahub delete --platform redshift --dry-run
[2024-03-06 20:13:35,266] INFO     {datahub.cli.delete_cli:341} - Using DataHubGraph: configured to talk to <http://localhost:8080>
[2024-03-06 20:13:36,009] ERROR    {datahub.entrypoints:201} - Command failed: ('Unable to get metadata from DataHub', {'message': '401 Client Error: Unauthorized for url: <http://localhost:8080/api/graphql'}>)

Do I need token to run the command? If so, how can I include the token in the command? Thank you. Datahub version: v0.12.1

modern-orange-37660

03/06/2024, 9:31 PM

I am dealing with a weird Tableau ingestion bug. A dashboard is ingested and can be found in DataHub, but if I browse to the one level up workbook, it doesn’t exist in there. Has anyone encountered a similar problem? UI Ingestion / CLI Version: 0.12.1

cuddly-dinner-641

03/07/2024, 4:02 PM

is the Snowflake ingestion source able to collect lineage for Dynamic Tables? It doesn't seem to be working for me

flat-bear-65100

03/08/2024, 2:10 AM

Hello Team! I’ve attempted several times to ingest data into datahub (v0.13.0) for both for S3 and Glue. I’ve tried through the UI and via the CLI. This is an example of what shows although no assets get ingested:

Copy code

'container': ['urn:li:container:8e7ba34c02ebac26523e12b245223254',
                            'urn:li:container:8f14caa5a1220e7890ee5ca61d5c570d',
                            'urn:li:container:cee410e83a7898b2dda07dc3440c7cfd',
                            'urn:li:container:83e2422984342072527ec4f411c231e8',
                            'urn:li:container:4be6f93ced89cf3af76c4d5aa0a4313f',
                            'urn:li:container:fbf321045931666f19a792a7bcbd2d2e',
                            'urn:li:container:7702dc6c60dc4dbdd8ba26f3dc6464ad',
                            'urn:li:container:8da75ef4e929ee8bdc0dc8287d16cd2b',
                            'urn:li:container:1d0508f2f359898db300c54bd57ad670',
                            'urn:li:container:196bbcab079fa9315eb6badccfa8befb',
                            '... sampled of 21 total elements']},
 'aspects': {'dataset': {'datasetProperties': 26, 'schemaMetadata': 26, 'operation': 26, 'container': 26, 'browsePathsV2': 52, 'status': 26},
             'container': {'containerProperties': 21,
                           'status': 21,
                           'dataPlatformInstance': 21,
                           'subTypes': 21,
                           'browsePathsV2': 42,
                           'container': 20}},
 'warnings': {},
 'failures': {},
 'soft_deleted_stale_entities': [],
 'filtered': [],
 'start_time': '2024-03-07 21:02:59.023267 (19.02 seconds ago)',
 'running_time': '19.02 seconds'}
Sink (datahub-rest) report:
{'total_records_written': 328,
 'records_written_per_second': 16,
 'warnings': [],
 'failures': [],
 'start_time': '2024-03-07 21:02:58.256726 (19.79 seconds ago)',
 'current_time': '2024-03-07 21:03:18.048629 (now)',
 'total_duration_in_seconds': 19.79,
 'max_threads': 15,
 'gms_version': 'v0.13.0',
 'pending_requests': 0}

flat-bear-65100

03/08/2024, 2:11 AM

CleanShot 2024-03-07 at 21.10.50@2x.png

quick-guitar-82682

03/08/2024, 5:17 AM

Hi everyone. I am looking to ingest structured properties via file ingests on DataHub. I was following documentation as to how to input the aspect, however, after ingestion no data seems to come through. This is what I am including in my ingest file. The structured property has properly been set up and tested on another dataset so I am not sure why it doesn't show up on this one { "com.linkedin.pegasus2avro.structured.StructuredProperties": { "properties": [ { "propertyUrn": "urnlistructuredProperty:io.acryl.props.keywords", "values": [ { "string": "foo" }, { "string": "bar" } ] }, { "propertyUrn": "urnlistructuredProperty:io.acryl.props.testProperty", "values": [ { "string": "Test" } ] } ] } },

some-zoo-21364

03/08/2024, 10:26 AM

Hi all, May I know if there is a way to map the Airflow DAG owner to Datahub custom group? I created a group with members via cli ingestion, and set the owner in my DAG as the group's id. DAG example below:

Copy code

default_args = {
    'owner': 'mygroup',
}

and the group yaml file contains..

Copy code

id: mygroup
display_name: "My Group"
email: "mygroup@example.com"

however triggering DAG creates a new user with type

CORP_USER

and urn

urn:li:corpuser:mygroup

, instead of mapping it the the group entity with urn

urn:li:corpGroup:mygroup

gifted-coat-97302

03/08/2024, 12:21 PM

Hello Team, we have been trying to ingest from Athena using CLI (0.12.1.5) based ingestion but struggling with the GMS service throwing

document_missing_exception

. There seems to be data in the

metadata_aspect_v2

database table but nothing in elasticsearch and nothing is visible in the datahub-frontend either. Datahub Details: • Version: 0.12.1 (using docker images with this version • deployment type: Kubernetes (AWS EKS) • deployment method: Custom internal Helm chart ◦ Frontend deployment separately ◦ GMS deployed with multiple replicas ▪︎ with MCE/MAE turned off ▪︎ metadata-auth enabled ▪︎ Hazelcast enabled (although we are having problems with this, so currently only running one replica) ◦ MAE consumer deployment separately with 2 replicas ◦ MCE consumer deployment separately with 1 replica Further details in the thread, any help will be much appreciated

miniature-magician-74764

03/08/2024, 7:49 PM

Hello Team, we have been trying to ingest from Athena using CLI (acryl-datahub, version 0.12.1.5) and dbt. We need both ingestions as not necessarily all assets in Athena will be part of the dbt universe. The problem: We are getting duplicated entities. The main problem seems to be that Athena is not adding the "Catalog" name in the urn. • Athena (top) URN:

urn:li:dataset:(urn:li:dataPlatform:athena,dq_cat_test.mod_cat1_test1,PROD)

• dbt (bottom):

urn:li:dataset:(urn:li:dataPlatform:dbt,AwsDataCatalog.dq_cat_test.mod_cat1_test1,PROD)

Is there a way to add the correct Data Catalog into the Athena Ingestion URN? Working with sibling would be impossible due to the volume and the data mesh schema we are developing.

Copy code

athena_ingestion_nonprod.py

# The pipeline configuration is similar to the recipe YAML files provided to the CLI tool.
pipeline = Pipeline.create(
    {
        "source": {
            "type": "athena",
            "config": {
                "aws_region": "us-east-2",
                "work_group":"primary",
                "query_result_location":"REDACTED",
                "catalog_name":"AwsDataCatalog"
            },
        },
        "sink": {
            "type": "datahub-rest",
            "config": {
                "server": "REDACTED",
                "token": "REDACTED"
                },
        },
    }
)

# Run the pipeline and report the results.
pipeline.run()
pipeline.pretty_print_summary()

Copy code

recipe.dhub.dbt_nonprod.yaml

source:
  type: "dbt"
  config:
    # Coordinates
    # To use this as-is, set the environment variable DBT_PROJECT_ROOT to the root folder of your dbt project
    manifest_path: "REDACTED/manifest.json"
    catalog_path: "REDACTED/catalog.json"
    sources_path: "REDACTED/sources.json" # optional for freshness
    test_results_path: "REDACTED/run_results.json" # optional for recording dbt test results after running dbt test

    # Options
    target_platform: "athena" # e.g. bigquery/postgres/etc.
    # incremental_lineage: False # Para cuando queremos borrar el linaje previo
    entities_enabled: # Multiple dbt projects
      sources: "no"

sink: 
  type: "datahub-rest"
  config: 
    server: "REDACTED"
    token: "REDACTED"

boundless-bear-68728

03/08/2024, 10:13 PM

Hi Team/@gray-shoe-75895 I having an issue with the Looker Ingestion. I could see discrepancies between the number of datasets that DataHub is showing vs the actual count of Datasets that exist in our Looker. The DataHub is showing only 97 Explores vs 300+ explores that we have. The Looker Ingestion logs show success but I still could not see all the records in DataHub. Can you please help me how to resolve this issue

ripe-machine-72145

03/09/2024, 2:03 PM

Hi Team, Is there any better way to ingest csv file metadata. UI based 0.13 Csv

worried-agent-2446

03/10/2024, 2:42 PM

Hello! I’m using DataHub. And I’m considering ingesting mysql with SQL Queries(https://datahubproject.io/docs/generated/ingestion/sources/sql-queries/) to view column level lineage. I’d like to know whether I can use ddl sql (like CREATE TABLE or DROP TABLE …) in SQL Queries for this purpose.🙇‍♂️

clean-magazine-98135

03/11/2024, 2:42 AM

Hi all! I'm using DataHub on version 0.13.0. I wanna connect to a hive database using UI ingestion feature. Could you please offer me a recipe demo of hive database connection? Thanks a lot.

rich-barista-93413

03/11/2024, 9:24 AM

Hey there! 👋 Make sure your message includes the following information if relevant, so we can help more effectively! 1. Are you using UI or CLI for ingestion? 2. Which DataHub version are you using? (e.g. 0.12.0) 3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)