DataHub #getting-started

narrow-bird-99605

02/16/2022, 11:30 AM

Hi everyone! I would like to ask about experience w/ the DataHub. We need to ingest metadata from the next applications: Redshift, Postgres, Mysql, S3, Athena, Glue, Spark, Tableau, DBT and BigQuery. The amount of objects is more than 100K. I am trying to grasp the amount of work to have it in prod. Questions: 1. After playing several days with the product I see that it is capable of ingesting all this data, but I feel it needs 2 FTEs for 3-4 month at least to set up every source step by step. 2. How much support on daily basis you do for the DataHub? 3. How often you encounter bugs? 4. How do you delete stale objects? Any instruction? 5. How do you validate the data quality? 6. How do you set it up? We are in AWS and I see that there is a recipe to install it there. But I would assume it needs integration into current infra and because there are many components for the project, how long did you spent to have it up and running in your cloud? (ofc its subjective) 7. How do you maintain the ownership? E.g. if there are objects in the same schema owned by different teams? Do you use some naming strategy to automatically set proper ownership or you use tags in some way? 8. Any other complexity I am missing? Thanks in advance!

gentle-optician-51037

02/17/2022, 8:28 AM

Hi gus，I do with quickstart guid to this step： python3 -m pip install --upgrade pip wheel setuptools python3 -m pip uninstall datahub acryl-datahub || true # sanity check - ok if it fails python3 -m pip install --upgrade acryl-datahub python3 -m datahub version but when i run python3 -m datahub docker quickstart ，It has some small problems 。 Is there a problem with my virtual machine environment？？？？ I dont know how to deal it

witty-painting-90923

02/17/2022, 9:23 AM

Hi all! I am trying to test ingestion DAG, basically one from example: https://github.com/linkedin/datahub/blob/master/metadata-ingestion/src/datahub_provider/example_dags/mysql_sample_dag.py But i dont have access to datahub from airflow for now. So i need to just make sure that the metadata gets extracted ok. I tried: • remove sink (doesnt work, required argument) • add “dry_run = True” at the end of Pipeline.create (it gives me connection error) So my question is: • is it possible to “mock” an ingestion without existing connection to datahub? • does “dry run” need a connection to datahub or maybe this error is caused by something else Thank you in advance for your support

abundant-lizard-52842

02/17/2022, 10:26 AM

👋 Hi everyone! Good Day! I have setup Datahub in EKS for POC but i am not to change the default password, could you please guide.....how to reset default password.

kind-diamond-98276

02/17/2022, 1:17 PM

Hi, i am not understand how to connect datahub with keycloak aunt module, can u guide me please ?

handsome-football-66174

02/17/2022, 2:15 PM

Hi Everyone - Quick Question - Are we storing all of the data for a given item in ElasticSearch or just the searchable fields for that item? If the latter, do you then use the unique ids (assuming those are stored along the searchable fields in ES) to query Postgres and return the data?

breezy-camera-11182

02/18/2022, 10:14 AM

Hi all!, is the Owner Role, removed in the new version? i saw in the datahubdemo when adding an owner, there is no role specification. In the previous version I’m still able to choose the role. is it only in the UI? Btw, I’m using v0.8.6

busy-art-44452

02/18/2022, 11:33 AM

Hello everyone, I’m currently exploring DataHub. I have ingested some datasets via the

data-lake

source and added properties to it via a

add_dataset_properties

transformer. The properties show up in the UI, but would there be a way to search (or filter) datasets based on those properties?

rapid-king-93225

02/21/2022, 8:58 PM

Hi I just started exploring DataHub and wonder if you plan on an integration with: https://dagster.io/ and their nice software-defined-assets.

boundless-pharmacist-30987

02/22/2022, 6:03 AM

Hey, very basic question. I saw a demo from Saxo Bank, where they customized the "landing page" with their own logo. Is there an easy way to do this (like a config), or would I need to fork the repo?

freezing-area-98807

02/22/2022, 9:20 AM

Hello everyone, I absorb snowflake datahub at the data source is set profiling: enabled: true, but always absorb failure, error message for snowflake.connector.errors.ProgrammingError: 090106 (22000): Cannot perform CREATE TEMPTABLE. This session does not have a current schema. Call 'USE SCHEMA', or use a qualified name. How should I configure the yamL file to solve this problem?

eager-gigabyte-36894

02/22/2022, 2:13 PM

greetings, Im very new to datahub, and sort of went through datahub docs kinda end to end. I wanted to know if datahub fits our problem description. we have data discovery and reusability problem at work. So we want to store datasets and retrieve them (mostly CSVs or sqlite used for ML) And datahub UI kinda convinced our team that it'd help us. But going through docs, i feel that only way to insert metadata into datahub seems to be via CLI. Also from these CSVs should have strict schema and metadata ingestion can happen via AWS glue only, doesn't seem to be any other way. My questions and doubts regarding the use cases: 1. Can I insert any CSV / .sqlite (varying schema) without AWS glue? did look at this: https://datahubproject.io/docs/metadata-ingestion/source_docs/s3 2. Can I get list of all metadata ingestions that happened previously, via an API/library call? 3. Can I search on keywords apart from the url way, via lib/SDK/REST api?did go through this : https://datahubproject.io/docs/how/search im a python dev, so anything surrounding it would be nice. Thanks in advance! 🙏

witty-painting-90923

02/22/2022, 4:17 PM

Hello! Can “platform_instance” be filtered? In my case platform_instance is “ESdev” and there seem to be no filters 😞 Thank you in advance! gms v0.8.26 datahub cli v0.8.26.3

bland-orange-13353

02/22/2022, 5:40 PM

This message was deleted.

bored-dress-52175

02/23/2022, 1:39 PM

what does -u ,-a mean in these arguments ?

able-rain-74449

02/23/2022, 3:35 PM

hi all Q's what bare minimume to run Datahub? do you have manifest for all somewhere.

Helm

add bit of complaxity.

able-rain-74449

02/23/2022, 4:36 PM

is there

stable/datahub

available ??

billions-table-9927

02/23/2022, 6:58 PM

Hello! We are trying to explore the DataHub features and trying to load lineage and queries from Snowflake and found-out that there no access to

snowflake.account_usage.access_history

because we don't have Enterprise Edition, is there any other way to load lineage and queries to DataHub?

strong-battery-10900

02/24/2022, 1:06 AM

Hi all, when l build the project using

./gradlew build

, l get the below error:

Copy code

> Task :metadata-service:restli-impl:validateModels
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See <http://www.slf4j.org/codes.html#StaticLoggerBinder> for further details.

> Task :docs-website:generateGraphQLDocumentation
yarn run v1.22.0
warning ../../../package.json: No license field
$ docusaurus docs:generate:graphql
/Users/yaoyichen/IdeaProjects/datahub/docs-website/node_modules/@docusaurus/core/bin/docusaurus.js:49
if (notifier.update && <http://semver.gt|semver.gt>(this.update.latest, this.update.current)) {
                                             ^

TypeError: Cannot read property 'latest' of undefined
    at Object.<anonymous> (/Users/yaoyichen/IdeaProjects/datahub/docs-website/node_modules/@docusaurus/core/bin/docusaurus.js:49:46)
    at Module._compile (internal/modules/cjs/loader.js:1063:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:1092:10)
    at Module.load (internal/modules/cjs/loader.js:928:32)
    at Function.Module._load (internal/modules/cjs/loader.js:769:14)
    at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:72:12)
    at internal/main/run_main_module.js:17:47
error Command failed with exit code 1.
info Visit <https://yarnpkg.com/en/docs/cli/run> for documentation about this command.

> Task :docs-website:generateGraphQLDocumentation FAILED

> Task :datahub-web-react:yarnGenerate
[08:55:51] Generate [completed]
[08:55:51] Generate to src/ (using EXPERIMENTAL preset "near-operation-file") [completed]
[08:55:51] Generate outputs [completed]
Done in 7.63s.

> Task :metadata-ingestion:environmentSetup
Requirement already satisfied: pip in ./venv/lib/python3.8/site-packages (22.0.3)
Requirement already satisfied: wheel in ./venv/lib/python3.8/site-packages (0.37.1)
Requirement already satisfied: setuptools==57.5.0 in ./venv/lib/python3.8/site-packages (57.5.0)

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':docs-website:generateGraphQLDocumentation'.
> Process 'command '/Users/yaoyichen/IdeaProjects/datahub/docs-website/.gradle/yarn/yarn-v1.22.0/bin/yarn'' finished with non-zero exit value 1

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at <https://help.gradle.org>

Deprecated Gradle features were used in this build, making it incompatible with Gradle 6.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See <https://docs.gradle.org/5.6.4/userguide/command_line_interface.html#sec:command_line_warnings>

BUILD FAILED in 38s
165 actionable tasks: 21 executed, 144 up-to-date

full-cartoon-72793

02/24/2022, 7:56 PM

Hello! Any guidance or docs on how to deploy DataHub into Azure? I saw the pages for AWS and GCP, but there is little on how to deploy into Azure in the doc pages I have seen so far

able-rain-74449

02/25/2022, 11:17 AM

Hi all, QQ: in terms of using Datahub as pull based , you would only use mysql?? no kafka, elasticserach?

able-rain-74449

02/25/2022, 11:22 AM

what is needed for Pull-based VS push-based all needed??

bland-orange-13353

02/25/2022, 11:42 AM

This message was deleted.

😃 1

witty-painting-90923

02/25/2022, 2:34 PM

Hello! Another general question from me 🙂 Imagine you are using datahub, and in one of your databases, a table is removed. So it is synced with datahub through a stateful ingestion, right? And metadata gets updated. But in the page of docs, it says that stateful ingestion is only supported for SQL based sources. Then question: how do people using datahub in production, are removing stale data on e.g. Elasticsearch and other databases? or is it really not possible? https://datahubproject.io/docs/metadata-ingestion/source_docs/stateful_ingestion/#datahub-ingestion-state-provider

wooden-gpu-74321

02/26/2022, 8:34 AM

Hi everyone, i install datahub with the quickstart guide. I want to use the google authentification. I config everything in gcp. But how i can configure datahub-frontend with OIDC ?

able-rain-74449

02/28/2022, 8:31 AM

Hi All, is there Datahub Architect diagram i can see what connections between components? like mysw, elasticsearch, kafka , ne04j, GMS etc

hallowed-toddler-12609

02/28/2022, 9:33 AM

Hi all. I have some questions about DataHub, i would be glad to your answers:) • Does it have a Vertica connector? I didn't see it at platform list. Is it possible to add it yourself? • Do Datahub has a workflow to request and confirm changes in data catalog? • Do it has automatic lineage, how i can build it?

gentle-optician-51037

03/01/2022, 1:59 AM

Hi, I encountered a bit of trouble when using the datahub. Why do I have to go to the warehouse to pull a new image every time I use the command, instead of using the image I pulled down for the first time? I now have too many images with tag none in the warehouse。 My startup command is python3 - m datahub docker quickstart --quickstart-compose-file ./docker/quickstart/docker-compose-without-neo4j.quickstart.yml 。 And I checked the compose file, and the image content did not contain the "-" , Is this a docker problem ？

bored-dress-52175

03/01/2022, 2:33 AM

I have to use some environment variables in the config. But my is doubt how do I reference it? I have made an .env file and I have mentioned all environmental variables in that file. Is this the right way? If it is how it will look for .env file, Do I have to put some extra parameters in datahub ingest command?

agreeable-river-32119

03/01/2022, 5:27 AM

Hi folks!I would like to ask why we didn’t have the table level lineage based on hive.Just like ClickHouse.