DataHub #ingestion

abundant-airport-72599

03/08/2023, 10:37 PM

hey all, I’m brainstorming on sql lineage extraction and wanted to see if others have been down a similar path. I’m currently working on figuring out a way to populate trino lineage in our datahub instance and I have a log that includes all of the sql statements executed against our trino cluster, to start I’m thinking of filtering down to insert and CTAS statements and then just churning through them and using something like https://github.com/reata/sqllineage to figure out which tables data is coming from and which it’s going to. I’ve played around with it a little bit and seems to work but wonder if anybody else here has tried this package or similar (from any kind of sql audit log) and if so how it’s worked out? Worried about potential pitfalls

full-football-39857

03/09/2023, 4:21 AM

Hi everybody, we done installation Datahub, we tested ingestion 2 database which Datahub support ( Mysql, MongoDB). however which Datahub not support to ingest and need to manual as ODI (Oracle Data Integrator), OGG, JupyterHub, OAS...Would you please sharing any document guideline, thanks.

✅ 1

curved-truck-53235

03/09/2023, 5:10 PM

Hi everyone! Is there some decision to configure allow/deny patterns with environment variables?

✅ 1

red-waitress-53338

03/09/2023, 6:15 PM

Hi Team, I know with BigQuery profiling we can specify the schema name for the creation of temp tables using

bigquery_temp_table_schema

, but this schema should be there is the current project from where we ingesting the tables. Is there a way to mention the schema from a different project? so rather than creating temp tables in all the source projects, can we reroute the temp tables to another designated common project?

important-beach-76585

03/10/2023, 8:20 AM

Hi everyone! How do i ingest hive queries?

✅ 1

important-helmet-98156

03/10/2023, 8:42 AM

Hello everybody, Maybe a dump question but I am wrapping my head around ingesting all metadata from a GraphQL API endpoint. 🤔 Following scenario: • We have a product information store: https://pimcore.com/docs/data-hub/current/GraphQL/index.html that exposes a GraphQL endpoint. • We would like to have all information that the pimcore holds as metadata in our data hub. • Currently I would create an OpenAPI Doc from the GraphQL enpoint and then ingest this openAPI spec. but can´t there be a more elegant way of doing that? Best regards 🙂

✅ 1

wonderful-jordan-36532

03/10/2023, 8:57 AM

It seems ingestion for machine learning models is deprecated. Are there any alternatives to ingest ML models? Machine Learning Models | DataHub

✅ 1

elegant-salesmen-99143

03/10/2023, 9:47 AM

Is it possible to import and display Kafka consumer groups and producer groups in Kafka as a datasourse? Like one of the properties, a group that an ingested topic belongs to. In the documentation I've found only mentions of consumer groups regarding Kafka as a tool that Datahub itself requires to operate

✅ 1

modern-salesmen-91339

03/10/2023, 12:10 PM

Hi team, I have just setup daâthub, is there any way to push lineage to from python pipeline use sqlalchemy?

✅ 1

big-hair-55893

03/10/2023, 4:04 PM

Hi Team. I built a data pipeline with dbt core and airflow (yes, jaffle_shop for a starter). Does anyone know what I need to do so that the data lineage (e.g. raw_customers -> stg_customers -> customers) is showing up in datahub? Or is this only possible by using dbt cloud and using dbt cloud as an ingestion source?

✅ 1

lively-dusk-19162

03/10/2023, 9:14 PM

Hi team, Can anyone please help me on the following error: I was trying to build datahub-graphql-core using ./gradlew datahub-graphql-core:build When I do so, There is a line with mappingHelper.getResult() where that function is taken from the following code path: datahub-graphql-core/src/main/java/com/linkedin/datahub/graphql/types/dataset/mappers/DatasetMapper.java Where in MappingHelper is imported from the following path: datahub-graphql-core/src/main/java/com/linkedin/datahub/graphql/types/common/mappers/util/MappingHelper.java And getResult() is not available in MappingHelper.java path I am getting error like “cannot find symbol” When I build datahub-graphql-core, it gives me error in all entities i.e, for datasetmapper, dashboardmapper,chartmapper etc

✅ 1

late-dawn-4912

03/11/2023, 6:39 PM

Hi All! We are using LookML and Looker ingestion (in that order) and we are having trouble getting our explores in. The views are loaded (the lineage to the corresponding tables too), but when we deploy a simple explore (based on a existing view in datahub) it doesn't appear and the following message is found in the LookML Recipe: [2023-03-11 181543,047] DEBUG {datahub.ingestion.source.looker.lookml_source:1635} - Attempting to load view file: ProjectInclude(project='__BASE', include='/tmp/tmpdhh07mpvlookml_tmp/3a66b9f7-6854-4d6a-95bf-63fa484cef30/checkout/datahub/explores/transacciones.explore.lkml') [2023-03-11 181543,048] DEBUG {datahub.ingestion.source.looker.lookml_source:576} - Skipping file /tmp/tmpdhh07mpvlookml_tmp/3a66b9f7-6854-4d6a-95bf-63fa484cef30/checkout/datahub/explores/transacciones.explore.lkml because it doesn't appear to be a view file Our LookML ingestion connects to a GITLAB repo and the merges are all ok into our branch. Do you have any pointers?

microscopic-room-90690

03/13/2023, 6:08 AM

Hi team, I'm wondering if data volume will take effect on performance of ingestion. Take S3 as an example, in dev env it just takes 1h for ingestion while in prod it takes more than 10h. And the difference is the data volume in prod can be much larger than that in dev, will Datahub scan all S3 files?

happy-camera-26449

03/13/2023, 7:51 AM

Hi, I am trying to ingest some tables from clickhouse into datahub. I am using clickhouse usage. A few days ago, the same tables were getting ingested however now only two tables are shown to be ingested. All tables are ingested through clickhouse module, but not clickhouse usage. Can anyone help with this?

big-hair-55893

03/13/2023, 9:32 AM

Hi team. As far as I understand, there is no column-level data lineage available for meta data source ingestion like dbt/dbt cloud, correct?

lively-dusk-19162

03/13/2023, 12:20 PM

Hi team , Is it necessary to include pom.xml in the project?

✅ 1

fierce-guitar-16421

03/13/2023, 12:58 PM

Hello Great Community! Is there a defined way to ingest groups from Google Cloud into DataHub as CorpGroup entities?

✅ 2

rough-energy-46557

03/13/2023, 1:12 PM

Hi All. There is a one question about profiling data sources. Is there a opportunity for change profiling settings? For example, if I want to add new checking: how many row contains negative value. I think it's possible make with Great Expectations framework, and the results of new checks will be at Validation tab. But may be there is a another way?

✅ 1

calm-dinner-63735

03/13/2023, 3:39 PM

is there any programmatically we link two data source msk kafka topic with Glue table pointing to schema registry

hallowed-lizard-92381

03/13/2023, 11:04 PM

dbt Newly finding that ingestion is blocked with

Copy code

'message': 'java.lang.UnsupportedOperationException: Aspect: datasetKey does not currently support patch operations.', 'status': 500}

Probably related to the default

write_semantics

param PATCH

“Whether the new tags, terms and owners to be added will override the existing ones added only by this source or not. Value for this config can be “PATCH” or “OVERRIDE”

Looks like its trying to patch a dataset when only a datasetsnapshot can be patched?

✅ 1

chilly-engineer-73807

03/14/2023, 9:28 AM

Hi guys. I need advice on creating lineage relations between Looker and Snowflake. I have created Looker, LookML, and Snowflake ingestion, of course. All of them has the same pipeline_name but I don’t see the relations between the Snowflake and Looker in Lineage. May be do you have a working example ingestion for this task?

thousands-printer-59538

03/14/2023, 1:33 PM

Hi everyone, I am currently trying to configure mongo datasource with slave member of a replicaset through cross account VPC(AWS Privatelink) in datahub, our analytics and production are hosted in two different AWS accounts, when i am trying ingestion on mongodb dataset, it is getting failed as it is trying to resolve the actual hostnames in the replica set though we have configured the connection uri using privatelink dns. Any ideas on how to resolve this?

✅ 1

damp-lighter-99739

03/14/2023, 2:49 PM

Hi everyone, Im new to DataHub and had recently deployed it on eks and ingested a few sources. Had a question around MCPs. While ingesting with sink as datahub-rest, where can i see all MCP requests? (Might be a repeat question here, sry abt that😅)

✅ 1

rhythmic-church-10210

03/14/2023, 5:00 PM

Hey all Im trying to ingest multiple databases from Athena but once I specify a schema pattern, the databases become datasets in Datahub. They are not registered as databases/schemas under which there are tables. Can someone help? this is my recipe:

Copy code

source:
    type: athena
    config:
        aws_region: us-east-1
        work_group: primary
        schema_pattern:
            allow:
                - '^auto_*'
        query_result_location: 's3://......'

✅ 1

high-night-94979

03/14/2023, 7:30 PM

Hi all! I am currently attempting to use the csv-enricher plugin to apply enrichment for tags, and updated descriptions to different types of entities. This is working great for datasets, for s3 and snowflake platform. However, when using this approach for MLFeatures updated descriptions do not seem to be picked up (note that tags are). Is this the expected behaivor to MLFeature enrichment from a csv file?

rich-state-73859

03/14/2023, 8:04 PM

Hi guys, when searching some dataset that has siblings, which platform will be listed in the search result?

shy-dog-84302

03/15/2023, 5:45 AM

Hi! I am trying to build a GraphQL ~~Java~~ Kotlin client for DataHub and started generating code based on graphql files from datahub-graphql-core I get the following error while compiling the generated ~Java/~Kotlin code

Copy code

e: file:///Users/***/build/generated/source/kotlin/org/***/data/datahub/graphql/generated/model/SchemaMetadata.kt:19:18 Type of 'version' doesn't match the type of the overridden var-property 'public abstract var version: Long? defined in org.entur.data.datahub.graphql.generated.model.Aspect'

I see that this is due to a mismatch in overriden property

version: Long!

in type SchemaMetadata against

version: Long

in interface Aspect I am not an expert in GraphQL schema design but this sounds obvious to me. Aligning the types resolves the error. Do you think it is a genuine error and needs to be fixed? 👉 Please suggest the right channel to post this question if it doesn’t fit here 🙂

adorable-computer-92026

03/15/2023, 10:50 AM

Hi everyone, i'm trying to ingest data from MySQL but it failed even if i put the correct configurations in MySQL recipe, what should be the problem ? i get this error :'

Copy code

~~~~ Execution Summary - RUN_INGEST ~~~~
Execution finished with errors.
{'exec_id': '4245aa64-3224-4785-a2f9-b9808ea39a3a',
 'infos': ['2023-03-15 10:53:21.632486 INFO: Starting execution for task with name=RUN_INGEST',
           "2023-03-15 10:53:29.782012 INFO: Failed to execute 'datahub ingest'",
           '2023-03-15 10:53:29.782346 INFO: Caught exception EXECUTING task_id=4245aa64-3224-4785-a2f9-b9808ea39a3a, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 231, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
 'errors': []}

~~~~ Ingestion Report ~~~~
{
  "cli": {
    "cli_version": "0.10.0",
    "cli_entry_location": "/tmp/datahub/ingest/venv-mysql-0.10.0/lib/python3.10/site-packages/datahub/__init__.py",
    "py_version": "3.10.9 (main, Jan 23 2023, 22:32:48) [GCC 10.2.1 20210110]",
    "py_exec_path": "/tmp/datahub/ingest/venv-mysql-0.10.0/bin/python3",
    "os_details": "Linux-5.10.16.3-microsoft-standard-WSL2-x86_64-with-glibc2.31",
    "mem_info": "90.65 MB"
  },
  "source": {
    "type": "mysql",
    "report": {
      "events_produced": 0,
      "events_produced_per_sec": 0,
      "entities": {},
      "aspects": {},
      "warnings": {},
      "failures": {},
      "soft_deleted_stale_entities": [],
      "tables_scanned": 0,
      "views_scanned": 0,
      "entities_profiled": 0,
      "filtered": [],
      "start_time": "2023-03-15 10:53:25.664072 (1.32 seconds ago)",
      "running_time": "1.32 seconds"
    }
  },
  "sink": {
    "type": "datahub-rest",
    "report": {
      "total_records_written": 0,
      "records_written_per_second": 0,
      "warnings": [],
      "failures": [],
      "start_time": "2023-03-15 10:53:25.315111 (1.67 seconds ago)",
      "current_time": "2023-03-15 10:53:26.984474 (now)",
      "total_duration_in_seconds": 1.67,
      "gms_version": "v0.10.0",
      "pending_requests": 0
    }
  }
}

agreeable-mechanic-5913

03/15/2023, 1:05 PM

I’m trying to do big query ingestion, but I keep getting errors. 😂

✅ 1

white-vegetable-93125

03/15/2023, 3:58 PM

Hello everyone, I-m trying to ingest my entire database through the datahub browser UI. I want to be able to search individual rows of my database, not just the column stats. How do I upload my entire database and view its contents rather than just column names and the statistics associated to them?

✅ 1