DataHub #ingestion

nice-autumn-10105

01/12/2022, 3:13 PM

Team, sub domains are critical for metadata management. To create a sub-domain like Customer/Activity is the option to use: table_pattern: allow regex to control only the entitles in the sub domain transformers: - type: "set_dataset_browse_path" To create the subdomain in the UI would that be the correct approach to have datasets broken out by sub domains?

modern-monitor-81461

01/12/2022, 3:30 PM

Superset and Virtual Datasets I have a question regarding the ingestion of Superset and virtual datasets. Superset support 2 types of datasets: • Physical: refers directly to a physical table or a view in a DB somewhere • Virtual: refers to a query on a physical table or view. It doesn't exist on the DB, is only a "thing" in Superset If I'm not mistaken, a virtual dataset could be the equivalent of a Looker Explore for those familiar with Looker. When I ingest Superset dashboards and charts, I see charts that are using physical tables and those are nicely mapped to the proper source in DataHub's lineage. But when a chart is using a virtual dataset, it refers to a urn using the

table_name

provided by Superset, but that name is the name of the virtual dataset, which doesn't exist in DataHub. I think there should be a connection to the physical dataset like the following:

Dashboard -> Chart -> Dataset (virtual) -> Dataset (physical)

Right now, the virtual dataset refers to void, so I can't tell which physical dataset is being used for the query. And the query should be a property of the virtual dataset in my opinion. Am I missing something in my deployment, or there is actually a gap in DataHub's Superset source?

shy-parrot-64120

01/12/2022, 9:05 PM

Hello all found incosistent thing in

kafka-connect

ingestion lib when processing jdbc source connector set for postgres source via url

jdbc:<postgresql://host>:port/db

its datasets are being ingested of

source_platform=postgresql

rather than

postgres

(as for postgresql ingestor) this causes entities mismatch is there a way to handle this?

eager-gpu-17565

01/13/2022, 9:04 AM

@here I want to know if I could ingest metadata from Salesforce into DataHub? Please assist me. Thanks :)

quaint-branch-37931

01/13/2022, 9:12 AM

Hey all! Since 0.8.16, ingesting a dbt model will create two nodes - one for the dbt model and one for the underlying platform. I think this is definitely an improvement over the previous behavior, but it does seem to make lineage a bit more complex because all chains are now double the length 🙂 Is there a way around this? Ideally I think we'd see eg. a merged dbt/bigquery node in the lineage graphs, that would combine the information.

miniature-television-17996

01/13/2022, 10:00 AM

Hello! could you please explain me how to ingesting procedure or function from database i got only tables i need this information to create lineage thank you

many-pilot-7340

01/13/2022, 5:10 PM

Is there a way to specify multiple configs in a single ingestion file. Eg. I have this ingestion file for mysql currently with localhost as my host port. Can I specify another config with a different ingestion hostport, database etc in the same file ?

gentle-florist-49869

01/13/2022, 5:22 PM

Hi team, anyone have examples, documentation about profilling data demo please?

best-planet-6756

01/13/2022, 5:55 PM

Hello Team, we are currently running into an issue when ingesting a mysql db with limited profiling. Here is the error we are seeing (db and table names have been changed for security reasons):

Copy code

'db.table': ["Profiling exception (pymysql.err.OperationalError) (1046, 'No database selected')\n"
                                             '[SQL: CREATE TEMPORARY TABLE ge_temp_b7535ba0 AS SELECT * \n'
                                             'FROM db.table \n'
                                             ' LIMIT 100]\n'
                                             '(Background on this error at: <http://sqlalche.me/e/13/e3q8)>']

Please note that in our ingestion recipe we do not specify a db name since there are multiple dbs at this source. When not profiling and only ingesting it works perfectly and ingests all of the dbs at the source. Looking to find a fix that will allow us to ingested with limited profiling and not have to specify the db as we want it to profile all of the dbs.

important-machine-62199

01/14/2022, 5:33 AM

Hi, thanks in advance for your help. we installed DataHub, for exploring as metadata aggregator accompanying IoT datasets. Past my metadata is in the same DB(Mongo) along with data in simple json/json-ld format. How can I ingest the same metadata in json format with my specific information model(key-value sets) into DataHub information model(key-value sets/pairs). Though DataHub is rich with features for exploring meta-data aggregation, browse through search & lineage, I am finding difficult moving forward in moving my data into DataHub.

glamorous-carpet-98686

01/14/2022, 5:50 AM

Hello Team, we are wondering that is there anyway to Add Documentation for dataset by using Python script (like Python Emitter) ?

mysterious-nail-70388

01/14/2022, 6:32 AM

Hi , I want to know that how do we distinguish data source types with the same database name but different addresses so far? Through the implementation of ingestion discovery, the latter implementation of ingestion operation covered the previous metadata, but we need to distinguish the same data source database tables of ingesting different IP on a DataHub platform. I consider changing the name 'prod', but how should it be added and used

mysterious-nail-70388

01/14/2022, 6:40 AM

Hello Team, I want to add database tags, but at present, it seems only support to add table tags, will there be support for this aspect in the future?

boundless-student-48844

01/14/2022, 7:24 AM

Hi team, would like to seek your suggestions on how to handle below use case with DataHub 🙇 We use Hive internally for different data platforms, say AWS Data Lake & Azure Data Lake. We’d like to ingest Hive metadata for both data lakes but want to be able to differentiate entities for each data platform despite both are using Hive. And in the UI, users should be able to see tiles for different data platforms in the homepage (not just one Hive tile as in current DataHub). This can be closely translated to data environment in DataHub terminology. It could have been achieved by setting ENV config in the yaml recipes and thus could be part of dataset URN. However, I read that DataHub currently only supports 4 fabric types. What’s the best way we can enhance from the code to achieving this?

red-pizza-28006

01/14/2022, 8:19 AM

The latest release has some remarkable improvements in the profiling performance for Snowflake. Something that was taking close to 50 mins in the past now runs in under 12 mins. Well done team 🎉

🙌 2

mysterious-nail-70388

01/14/2022, 9:02 AM

Hi,I am back .I took metadata information from Hive database, used to have data in 'Properties', now it shows success, but dataHub database has no Properties aspect, I don't know why😅

few-air-56117

01/14/2022, 10:05 AM

Hi guys, i have a question. Where is the metadata saved? In mysql or in some files?. I am courios if i need to switch datahub on other k8s cluster how can i migrate also the metadata ingestion. Thx :d

clever-australia-61035

01/17/2022, 7:56 AM

Hello All, Could anyone please help me to know how could I remove all the business glossary terms at one shot from Datahub?

lemon-hydrogen-83671

01/17/2022, 4:21 PM

Hey folks, for adding a new user to datahub do we still use the

CorpUserInfo

object that's in all the examples or do we start using

CorpUserProperties

as specified here: https://datahubproject.io/docs/graphql/objects/#corpuserproperties?

agreeable-thailand-43234

01/17/2022, 8:18 PM

Hi guys! What would be the "best practice" go get data from a Postgres db hosted in AWS on a private VPC? I've created a

db in lake formation

, then a crawler to populate it, then plug Datahub into the s3 bucket...is this the best-cost effective way to do it? or would you rather to have an endpoint in an EC2 instance?

mysterious-nail-70388

01/18/2022, 3:35 AM

Hi team, when does DataHub support Azkaban scheduling？ I am using Azkaban to execute the scheduler, could you please put the Azkaban scheduling information into dataHub😄

mysterious-nail-70388

01/18/2022, 5:51 AM

Hi, In addition to using docker to deploy dataHub, can use non-Docker to deploy dataHub, including frontend-react and gms service, and related parameter configuration and database mysql, ES, kafka where to configure

best-television-56567

01/18/2022, 8:18 AM

Hi all, I'm looking for a tool which can give an overview of the setup of a kafka cluster, so • which topics are there • what are the schemas used for a topic • which producers and consumers exist for each topic • are there consumers which are lagging behind? I got the kafka-metadata ingest working but unfortunately this source lacks information about the producers and consumers. So I'm wondering if the kafka-connect ingest (https://datahubproject.io/docs/metadata-ingestion/source_docs/kafka-connect) will give me what I need, however the documentation is a bit unclear to me. Can someone tell me: • what the kafka connect is ingesting exactly? • why the kafka connect example configuration has some DB connection configured? Thanks in advance!

colossal-easter-99672

01/18/2022, 8:39 AM

Hello, team. How to use custom aspects in python ingestion? i am look to entities classes and they have restriction for aspects (only list of default aspects)

handsome-belgium-11927

01/18/2022, 10:56 AM

Hello, team. Is it possible to ingest Data Freshness via mcp with datasetUsageStatistics aspect? I can't see the corresponding attribute in the python class, though updated acryl-datahub to version 0.8.23.0

breezy-controller-54597

01/19/2022, 4:16 AM

Hello, team. If the recipe contains a password, it says below that we can define the password as a variable (e.g.

${MSSQL_PASSWORD}

). https://datahubproject.io/docs/metadata-ingestion#handling-sensitive-information-in-recipes Where can I define this variable? I created a .env file in the same directory as the recipe yml file with reference to docker-compose, but it didn't work.

careful-engine-38533

01/19/2022, 7:20 AM

I am trying to ingest mongodb, I get the following error "OperationFailure: not authorized on Integration_ContentDb to execute command { aggregate: "system.views", pipeline:" - any help?

acoustic-quill-54426

01/19/2022, 3:36 PM

Hi! I have integrated my company's custom ML platform with datahub using the python REST emitter and mapping our features to

MLFeatures

and

MLFeatureTables

. I had an issue with the

MLFeatureProperties

field

dataType

, which is optional, but once successfully ingested was raising exceptions here. I just added a default

UNKNOWN

dataType, but I guess it should be either mandatory or nil-checked. WDYT? I can drop an issue later 👍 I have a question about

MLFeatureTableProperties

aspect

customProperties

field, which is not rendered in the UI. In other entities as datasets they are always rendered. Am I missing something? Many thanks 🙌

damp-queen-61493

01/19/2022, 4:53 PM

Hi! I'm trying to ingest mssql from airflow. I get the following error

Copy code

[2022-01-19, 16:43:56 UTC] {local_task_job.py:154} INFO - Task exited with return code Negsignal.SIGKILL

I'm using the inline approach like this

red-pizza-28006

01/19/2022, 6:10 PM

i am wondering if there are anyways to see if the schema of a dataset has changed, and what was the schema at a particular point in time - Basically time travel for schema in Datahub