DataHub #getting-started

Join Slack

cool-architect-34612

04/10/2022, 11:11 PM

hi, Is there a way to display the data sample on the frontend?

cool-architect-34612

04/11/2022, 3:40 AM

Hi, I would like to know How can I change metadata storage to aws rds

cold-jelly-78088

04/11/2022, 10:39 AM

Hi! Quick question on static authentication. We are running on the quickstart setup, and have added a users.props file in .datahub/plugins/frontend/auth. This works well for adding new users. However, we've also tried changing the datahub default user password here, but we are still able to login using datahub:datahub (but also using the new password) after restarting the frontend. Any ideas on where we have gone wrong?

millions-notebook-72121

04/11/2022, 11:05 AM

Hi - not sure this is the right channel for this, but unsure where to post it. We're using Datahub currently as the data catalog in my company. We now are revisiting how we perform access controls to our data, and we're thinking of using a data catalog as the source of truth to figure out accesses. We want to use Tag-based access controls (TBAC), in particular in Lakeformation. In Datahub, we can tag assets (tables and columns), which is great. However, to implement TBAC, we also need the other side of the equation, that is which tags belong to which group/user. Is this something we can surface on Datahub at all? I can't seem to be able to add tags to groups, is this in fact a supported feature?

alert-football-80212

04/11/2022, 12:16 PM

Hi everyone! I am new to DataHub, and I would like to know if there is a beginner's guide/video to getting to know the basics and concepts of DataHub

nutritious-jackal-99119

04/11/2022, 1:15 PM

⠿ datahub-frontend-react Error 4.6s no matching manifest for linux/amd64 in the manifest list entries. Any idea on this guys ? I am trying to install datahub on my local mac book.

modern-monitor-81461

04/11/2022, 7:34 PM

Why does Looker show up as a platform and not Superset (see demo instance)? Is it because Looker offers

Explore

like views as opposed to Superset being strictly a BI tool (but it does offer virtual datasets, a bit like Explores, no?)

square-solstice-69079

04/12/2022, 8:18 AM

Is it possible to remove the database name (In Redshift) as the path when browsing? And/or also remove it from the urn?

hallowed-analyst-96384

04/12/2022, 6:59 PM

Hi everyone, can someone please help me out here is my recipe:`source:`

type: datahub-business-glossary

config:

file: business_glossary.yml

sink:

type: datahub-rest

config:

server: "${DATAHUB_REST_HOST}:${DATAHUB_REST_PORT}"

but after exporting

DATAHUB_REST_HOST=localhost

and

DATAHUB_REST_PORT=8080.

I do ingestion and get this error:

UnboundVariable: 'DATAHUB_REST_HOST: unbound variable'.

microscopic-mechanic-13766

04/13/2022, 8:28 AM

Hi, I am deploying Datahub in Docker and I have taken the quickstart docker-compose as a reference to build mine. While reading it I found the service datahub-actions. I have been reading the documentation and didn't found any reference to this service, what is its purpose?? And where could I find an explanation of some of the tags used in the mentioned docker-compose??

brave-forest-5974

04/13/2022, 12:26 PM

Is it possible in GraphQL to show the lineage between two nodes. Something other than pulling up/downstreams of both and comparing the overlap

famous-match-44342

04/13/2022, 1:06 PM

Unable to run quickstart: - Docker doesn't seem to be running. Did you start it?

hallowed-analyst-96384

04/13/2022, 4:35 PM

Hi everyone I'm having the following issue in Airflow:

ERROR - ('Unable to emit metadata to DataHub GMS', {'message': "HTTPConnectionPool(host='localhost', port=80): Max retries exceeded with url: /entities?action=ingest (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f16c9a1c970>: Failed to establish a new connection: [Errno 111] Connection refused'))"})

The Airflow connection is well set: `airflow.connections`:

- id: datahub_rest_default

type: datahub_rest

host: <http://localhost>

port: 8080

And the Airflow configuration file gets the right information:

AIRFLOW__LINEAGE__DATAHUB_KWARGS: '{ "datahub_conn_id": "datahub_rest_default", "capture_ownership_info": true, "capture_tags_info": true }'

I would like to add that the following Airflow was deployed in a Kubernetes environment using :

airflow:

image:

repository: acryldata/airflow-datahub

tag: latest

famous-match-44342

04/14/2022, 4:00 AM

[guang.wang@bj-bi-s ymlpackage]$ vim mysql.yml [guang.wang@bj-bi-s ymlpackage]$ python3 -m datahub ingest -c mysql.yml [2022-04-14 035906,714] INFO {datahub.cli.ingest_cli:88} - DataHub CLI version: 0.8.32.5 [2022-04-14 035907,058] ERROR {datahub.entrypoints:158} - File "/usr/local/python3/lib/python3.7/site-packages/datahub/cli/ingest_cli.py", line 95, in run 78 def run( 79 ctx: click.Context, 80 config: str, 81 dry_run: bool, 82 preview: bool, 83 strict_warnings: bool, 84 preview_workunits: int, 85 ) -> None: (...) 91 pipeline_config = load_config_file(config_file) 92 93 try: 94 logger.debug(f"Using config: {pipeline_config}") --> 95 pipeline = Pipeline.create(pipeline_config, dry_run, preview, preview_workunits) 96 except ValidationError as e: File "/usr/local/python3/lib/python3.7/site-packages/datahub/ingestion/run/pipeline.py", line 183, in create 176 def create( 177 cls, 178 config_dict: dict, 179 dry_run: bool = False, 180 preview_mode: bool = False, 181 preview_workunits: int = 10, 182 ) -> "Pipeline": --> 183 config = PipelineConfig.parse_obj(config_dict) 184 return cls( File "pydantic/main.py", line 511, in pydantic.main.BaseModel.parse_obj File "pydantic/main.py", line 329, in pydantic.main.BaseModel.init File "pydantic/main.py", line 1022, in pydantic.main.validate_model File "pydantic/fields.py", line 847, in pydantic.fields.ModelField.validate File "pydantic/fields.py", line 1118, in pydantic.fields.ModelField._apply_validators File "pydantic/class_validators.py", line 278, in pydantic.class_validators._generic_validator_cls.lambda2 File "/usr/local/python3/lib/python3.7/site-packages/datahub/ingestion/run/pipeline.py", line 72, in datahub_api_should_use_rest_sink_as_default 68 def datahub_api_should_use_rest_sink_as_default( 69 cls, v: Optional[DatahubClientConfig], values: Dict[str, Any], **kwargs: Any 70 ) -> Optional[DatahubClientConfig]: 71 if v is None: --> 72 if values["sink"].type is not None: 73 sink_type = values["sink"].type KeyError: 'sink' [2022-04-14 035907,058] INFO {datahub.entrypoints:162} - DataHub CLI version: 0.8.32.5 at /usr/local/python3/lib/python3.7/site-packages/datahub/__init__.py [2022-04-14 035907,058] INFO {datahub.entrypoints:165} - Python version: 3.7.0 (default, Apr 12 2022, 085540) [GCC 4.8.5 20150623 (Red Hat 4.8.5-44)] at /usr/local/bin/python3 on Linux-3.10.0-957.1.3.el7.x86_64-x86_64-with-centos-7.6.1810-Core [2022-04-14 035907,058] INFO {datahub.entrypoints:167} - GMS config {} [guang.wang@bj-bi-s ymlpackage]$

great-dentist-95905

04/14/2022, 9:12 AM

Hi Everyone, I am working on real time schema discovery using datahub where in there is a source which wants to push the metadata to the Datahub application that we have deployed. What all does the source needs to send, also are there any utilities to convert the pegasus format. We are planning to use kafka as a data broker in the middle. Any references or links would be really handy.

clean-nightfall-92007

04/15/2022, 6:56 AM

Copy code

2022-04-15 05:05:01.500:INFO::main: Logging initialized @420ms to org.eclipse.jetty.util.log.StdErrLog
WARNING: jetty-runner is deprecated.
         See Jetty Documentation for startup options
         <https://www.eclipse.org/jetty/documentation/>
2022-04-15 05:05:01.539:INFO:oejr.Runner:main: Runner
2022-04-15 05:05:01.736:INFO:oejs.Server:main: jetty-9.4.20.v20190813; built: 2019-08-13T21:28:18.144Z; git: 84700530e645e812b336747464d6fbbf370c9a20; jvm 1.8.0_302-b08
2022-04-15 05:05:03.302:WARN:oejw.WebAppContext:main: Failed startup of context o.e.j.w.WebAppContext@3339ad8e{/,null,UNAVAILABLE}{file:///datahub/datahub-gms/bin/war.war}
java.util.zip.ZipException: invalid entry CRC (expected 0xe78b2198 but got 0xde56297a)
        at java.util.zip.ZipInputStream.readEnd(ZipInputStream.java:394)
        at java.util.zip.ZipInputStream.read(ZipInputStream.java:196)
        at java.util.jar.JarInputStream.read(JarInputStream.java:207)
        at org.eclipse.jetty.util.IO.copy(IO.java:172)
        at org.eclipse.jetty.util.IO.copy(IO.java:122)
        at org.eclipse.jetty.util.resource.JarResource.copyTo(JarResource.java:218)

I found this error when using the official image to build；

acceptable-architect-70237

04/15/2022, 3:57 PM

Hi team, A question about lineage, or maybe how the graph DB works for the use case. For example, I have ingested a lineage with the following relationship. Both

DB.Table1

and

DB.Table2

conforms Datahub's UpstreamLineage Aspect definition.

Copy code

DB.Tabe1 -> DB.Table2

Later on, I ingested another upstream lineage for

DB.Table3

as such with correct aspect definition

Copy code

DB.Table2 -> DB.Table3

My question is, In Datahub, will

DB.Table3

automatically show the following deriving relationship. Or I query any node (DB.Table1, DB.Table2), it will automatically show the

upstream

and

downstream

if available?

Copy code

DB.Table2 -> DB.Table2 -> DB.Table3

I would assume it should show as such. Otherwise, we could just use

upstreamlineage

and

downstreamlineage

aspects to render the UI, don't need to use the graphdb at all.

fierce-city-89572

04/15/2022, 4:09 PM

Hi everyone, I am new to Datahub. I followed the QuickStart instruction and set up Datahub. The frontend GUI is using http with port 9002. Is there a way to use https for the GUI frontend? I cannot find any instruction on doing this in the documentation...

fresh-memory-10355

04/15/2022, 6:16 PM

Hello please guide me how to achieve lineage

adorable-receptionist-20059

04/15/2022, 9:06 PM

Can DataHub expose Quality Metrics like a. Last updated b. Data Ranges of the Data c. Unique Count And if so can someone point me to resources?

gentle-camera-33498

04/18/2022, 7:44 PM

Hello everyone! Can anyone describe the differences between deploying datahub without neo4j and with neo4j? What are the advantages? Another question: What is the best way to deploy the Datahub GMS server (on k8s) to avoid the 500 status code errors when running heavy ingestions at same time?

alert-football-80212

04/19/2022, 8:40 AM

Hi, there is an api or cli for validate tags or terms of a dataset in DataHub?

quaint-lighter-81058

04/19/2022, 6:17 PM

👋 Hi everyone! raise tds_base.Error('Client does not have encryption enabled but it is required by server, ' Azure Managed Instance Connection is failing from the recipe.

square-solstice-69079

04/20/2022, 9:38 AM

Is it possible to delete a database schema? I deleted all tables in the schema with a --query on the schema name. https://datahubproject.io/docs/how/delete-metadata Is there any parameters possible that is not mentioned in this documentation?

mammoth-fountain-32989

04/20/2022, 1:01 PM

Hi, Are there crawlers available for metadata refresh from sources (Postgresql, Hive, HDFS Files). I see the Emitters which can trigger metadata refresh and these can be invoked from CI/CD pipelines or deployment script (push mechanism) Is there a way Datahub can pull the delta changes using some kind of crawlers/listeners. We are evaluating if Datahub can be used as data catalog as well as operational metastore, description on the latter below: We have jobs that check for data refresh of upstream datasets and then trigger the downstream loads. Say, hourly data refresh of table_x for the hour 20220420 2300 completed in Hive at 2022-04-20 23:15 , and the job (pg_refresh_x) that refreshes Postgresql dataset from Hive can run. If this refresh information of Hive table_x can be stored in Datahub (when was the last refresh done, how latest is the data, the job that refreshed it, job start and end times etc), job pg_refresh_x keeps polling for these properties since it last loaded and runs when the latest upstream run is completed. And, can this history as well be stored, something like this: data_as_of | load_start | load_end | data_load_job_name | -------------------------------------------------------------------------- 20220420 2300 | 2022-04-20 23:02 | 2022-04-20 23:15 | hive_refresh_job | 20220420 2200 | 2022-04-20 22:05 | 2022-04-20 22:11 | hive_refresh_job | 20220420 2100 | 2022-04-20 21:20 | 2022-04-20 21:26 | hive_refresh_job | 20220420 2000 | 2022-04-20 20:04 | 2022-04-20 20:13 | hive_refresh_job | ------------------------------------------------------------------------ How do I maintain such history (say for one month for each dataset) attached to the dataset in Datahub. Should I create a custom aspect? Also, with this use case, as Datahub becomes operational metastore, does it support High Availability and how to configure the same if supported Thanks

hallowed-analyst-96384

04/20/2022, 1:26 PM

Hi everyone, I have a strange problem: after successfully implementing Airflow and Datahub in a K8s environment. I can see on the DataHub dashboard that the number of pipelines is 5 (which means 5 DAGs are running in Airflow and sending metadata), but only three are showing up and the others are not found in the DataHub user UI.

breezy-noon-83306

04/20/2022, 3:07 PM

Hi community ! I want feedback from you. I am a big advocate for deploying Datahub in the company I am working, but they ask me what do they need Datahub for. I have explained a lot of things to them but they still ask me what do they need it in the company for, which are the use cases than can be solved with the tool.... And I don´t know what more to tell them, please help me, what can I say them ? What would you say ? Do you think I should ask the community ? Thanks

ripe-alarm-85320

04/20/2022, 10:08 PM

Just did a small ad hoc demo of data hub to the team at my company and was pleasantly surprised at all the improvements since the last time I used it. I think it might be a good fit and we are strongly considering open source adoption in the coming weeks. Great stuff!

🎉 4

salmon-area-51650

04/21/2022, 12:22 PM

Hi team! 👋 One question. Is it possible to emit a transformation to add a tag in Kafka Connect pipelines ingestion? Thanks in advance

wooden-chef-22394

04/22/2022, 11:33 AM

https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:snowflake,long_tail_companions.analytics.ShelterDogs,PROD) shows 'Unauthorized'. Help!