DataHub #getting-started

silly-translator-73123

11/15/2021, 7:02 AM

i have added two glossary terms and want to delete demo glossary categories, after i deleted these demo items in business_glossary.yml and reinjest , demo glossaries are still there

orange-flag-48535

11/15/2021, 11:24 AM

How do I stop all the datahub docker containers? (I started them using datahub docker quickstart)

agreeable-hamburger-38305

11/16/2021, 2:14 AM

Hi all! I am currently ingesting users one by one as they log in through OIDC. Does anyone know how can they edit their

Ask me about

and

Teams

on their profile page? Thanks!

creamy-library-6587

11/16/2021, 7:02 AM

hi all, do we have swagger document so that we can find api easily

hundreds-finland-95017

11/16/2021, 6:02 PM

Hi All, I cloned the code and try to run it in my local. when I tried to start the service (call docker/quickstart.sh) It shows the error datahub-gms exited with code 255.

mammoth-pencil-22596

11/16/2021, 6:35 PM

I'm new to DataHub and i'm exploring DataHub as we are looking forward to use it in our organization. I had a doubt. • General notifications - what are the general notification mechanisms, can it be customised to a group of users. Can we send notification to slack, when a particular condition is met. Is this feature available in DataHub ?

aloof-airline-3441

11/17/2021, 12:53 PM

Hi all, there was some chatter about this back in May -- is this still happening? https://datahubproject.io/docs/rfc/active/1841-lineage/field_level_lineage/

nutritious-bird-77396

11/17/2021, 4:30 PM

I was looking for a documentation to switch out the db store from mysql to postgres. Link in this PR - https://github.com/linkedin/datahub/issues/1734 is broken. Could some help me point out to the updated documentation link? I am having a hard time finding it...

swift-lion-29806

11/17/2021, 6:53 PM

Hello, I'm exploring Datahub as an option for data catalog in our ml team (@ Elastic)

swift-lion-29806

11/17/2021, 6:55 PM

I'm trying to generate a file to be used in File metadata ingestion. For this I used the following yml source: type: file config: filename: ./data.json sink: type: file config: filename: ./output.json

agreeable-hamburger-38305

11/17/2021, 11:36 PM

Quick question: if a column description in BigQuery is “A”, it’s ingested into DataHub and the user edits it to “B” in the UI, and in BigQuery it’s changed to “C” and re-ingested, what would be shown in DataHub?

creamy-library-6587

11/19/2021, 6:43 AM

Hi, I try to run the command "./gradlew build", but it always return " java.nio.file.InvalidPathException: Illegal char <:> at index 188:"

swift-lion-29806

11/19/2021, 1:55 PM

Is there a way to clear all metadata entries from datahub?

datahub docker nuke

removes the datahub instance itself. Wondering if there is any command to just clean metadata entries?

nutritious-bird-77396

11/21/2021, 3:02 PM

Hey Team....For a PoC project I spinned of MSK, OpenSearch, RDS (PostGres) and schema registry and ran docker processes

datahub-frontend

and

datahub-gms

After that we ingested metadata from a local Redshift cluster using datahub ingestion recipe (source as redshift and sink as

datahub-kafka

) I was under the impression I might have to run MCE process to pickup the messages from Kafka to push to GMS but that wasn't necessary, I could already see the messages in frontend and Datastore (postgres).. Could someone explain this?

creamy-library-6587

11/22/2021, 4:48 PM

Hi team, do we have api to list all tags and glossary terms?

melodic-oil-59434

11/23/2021, 1:06 PM

Hello everyone, some basic questions: 1. I was wondering if there is a way to ingest database schemas? 2. Is it possible to integrate Great Expectations or another method for data validation? 3. What does the Queries tab refer to? 4. Is the tagging and business glossary term linking generally done manually, or are there use cases where this would be automated in some way? Sorry these are so basic and any response would be much appreciated!

dazzling-appointment-34954

11/23/2021, 5:30 PM

Hello everyone, first of all thanks for being such a great community! I have some basic questions about metadata ingestion that I could not figure out from a search here in slack. From what I understand we have 2 principles: • pushing metadata (e.g. through Kafka streams) in an automated manner from sources directly into Datahub (/through the metadata service) • pulling metadata through recipes (e.g. Postgres Databases) in a regularly running script First of all please correct me if I get this wrong please 🙈 And here are the questions: • Is there a general overview on the concepts and for what use-case which concept is suitable? ( I checked the very good docs section but it is rather technical I think ) • For what kind of sources can I enable the metadata pushes and what kind of additional customization / scripts do I need to setup? Is this only available via Streams or is it possible to push metadata in other ways too? • Is there some mechanism for delta pulling metadata after the initial sync or is it always a full sync for sources like Postgres Databases? What is happening to descriptions I entered through the UI when I pull additional metadata / restart a sync? • Can I ingest a custom metadata entity in any way (for example a potential customer wants to add KPIs with documentation and explanations to the catalog)? Thank you very much in advance for your answers and support!

boundless-scientist-520

11/24/2021, 8:25 PM

Hello! I ran a recipe to ingest Postgres metadata. Then I delete a table from the database and run the recipe again. The table continues to appear on Datahub. Shouldn't the table have been dropped in Datahub?

brief-cricket-98290

11/25/2021, 11:28 AM

Hello everyone, I’m in a process of building out a PoC for DataHub, and I’m struggling with Kafka ingestion. While ingesting Kafka topics is straightforward, I am not able to ingest schemas associated with each topic as we do not have schema registry deployed. Is it possible to manually add a schema to a specific topic in Datahub, just for the sake of testing? Edit: Does anyone have similiar problem (of not having a schema registry or having it but with Protobuf defined schemas) and did you manage to find some workaround with (Protobuf) schema ingestion? Thank you all!

agreeable-river-32119

11/29/2021, 1:09 AM

Hi team,I want to develop it locally.Can I run it without docker?

brave-forest-5974

11/29/2021, 8:53 AM

I ran into this elasticsearch issue after ingesting a portion of our data into a POC. Back-of-the-napkin calculation, we have around 200k assets right now. Do you have any tips for other settings that should be tweaked at this scale?

Copy code

{
  "error": {
    "root_cause": [],
    "type": "search_phase_execution_exception",
    "reason": "",
    "phase": "fetch",
    "grouped": true,
    "failed_shards": [],
    "caused_by": {
      "type": "too_many_buckets_exception",
      "reason": "Trying to create too many buckets. Must be less than or equal to: [65535] but was [65536]. This limit can be set by changing the [search.max_buckets] cluster level setting.",
      "max_buckets": 65535
    }
  },
  "status": 503
}

found this related thread in troubleshooting

chilly-analyst-45780

11/29/2021, 2:02 PM

Hello. I'm a bit in doubt wether or not datahub is the correct choice for what i'm looking for. Basically we want a datahub which acts as master for our customer data. So this means only one customer exists and a lot of metadata around that customer, and an API which can then serve this data. Is this what DataHub does? Or is it more of a way to see where data is, like, there would be multiple entries of the same customer, for each data source?

aloof-london-98698

11/29/2021, 2:51 PM

Hey team, is there a managed service (or saas) version of datahub available?

rough-garage-43684

11/29/2021, 7:06 PM

Hi~ I'm evaluating/planning to use great_expectations as a data quality solution for my data warehouse (that already using wonderful Datahub). AFAIK, integrating with great_expectations is part of Datahub's roadmap, and this week is the beginning of the development of this feature. Is there any detail of this feature list noted somewhere? As a part of evaluating great_expectations, I want to see how this great_expectations will be integrated with Datahub. Many thanks!

rich-greece-17287

11/30/2021, 4:46 AM

hello everyone

enough-london-69397

11/30/2021, 4:46 PM

Hi guys, I'm evaluating the tool and I have a doubt about some capabilities. I ingested some metadata from AWS Athena, and there is no explicit information about PK there, but we follow a splicit naming pattern for PKs as

id_ + 'table_name'

, so the

account

table has PK

id_account

. If I ingest this metadata through Datahub, can it smartly tell that

id_account

is a PK? On the image below it does not show anything like that:

acceptable-honey-21072

12/04/2021, 7:43 AM

I replicated the ‘docker’ folder and ran the ‘COMPOSE_DOCKER_CLI_BUILD=1 DOCKER_BUILDKIT=1 docker-compose -p datahub build’ command, is this the way to start DataHUb with Docker compose?

big-coat-53708

12/05/2021, 12:18 PM

Hi @mammoth-bear-12532, thanks for the wonderful article! The downside analysis of the first-generation architecture really make things clear. Our team has been using Amundsen for a while and now we are evaluating the potential of migrating to DataHub. The main reason is the push model, which is really not supported in Amundsen. The other reason is that we are facing performance issue with Amundsen. We have about

30k

tables in our Trino/Hive metastore environment, which ends up running

3 million

queries on neo4j. It takes about 90min for every sync, I’m still exploring whether DataHub could do better than this. There’s one thing that I would like to clarify on the push model. I understand that it provides an interface for triggering the ingestion. But for every ingestion, it still does a full pull
, right? It would still extract every single metadata from the source, am I correct? In my understanding, whenever I get an event from metastore, I could trigger an ingestion through Kafka or REST API. The ingestion basically pulls everything from the source and dump it into the sink according to the recipe. Please correct me if I’m wrong. Thanks 🙏

dazzling-appointment-34954

12/08/2021, 11:13 AM

Hi datahub experts, we are currently preparing for a PoC with Datahub at a big client here in Germany and I have a few questions: • Business glossary => Is it possible to define ownership for glossary terms and to include a little workflow for new elements? There is currently no UI to define glossary terms if I saw it correctly? Maybe using Github for the workflow and the auth and use the files for ingestion into Datahub is a good way? Are you having any best-practices? • Metadata Model extension => Can we include a new object like “KPIs” to the catalog and also create a lineage between these new objects? I saw the file based approach in the last community session to extend the metadata model, is this already available? • What is the best way to allow “commenting” or other collaboration on elements in the catalog? • Adding new datasources: We would like to create a new connector for SAP HANA (and open source it later probably). There is a standard JDBC Driver available. Can anyone provide me with a rough estimate how long it might take to create a running version of this new driver? We really love the product and the community so thank you for your support in advance. This might be a great opportunity to drive adoption of Datahub within Germany 🙂

brash-carpenter-51184

12/08/2021, 4:19 PM

Hello everyone! I’m currently playing around with DataHub and would like to know if there is any possibility to figure out from UI if there was a schema change e. g. a new column was added or type of column was changed?