DataHub #show-and-tell

fancy-fireman-15263

11/23/2021, 7:44 PM

FYI - managed to understand what proportion of our schemas have descriptions using the following query against the graphql:

Copy code

{
  search(input: { type: DATASET, query: "*", start: 0, count: 4000 }) {
    searchResults {
      entity {
         urn
         type
         ...on Dataset {
            name
          	schemaMetadata(version: 0){
              fields {
                fieldPath
                description
              }
            }
         }
      }
    }
  }
}

👍🏾 1

teamwork 2

👏 3

👍 3

stale-guitar-98627

01/31/2022, 7:48 PM

Hi friends! I am a co-chair for the SciPy Conference Data Lifecycle track this year. Our goal is to introduce industry best practices to scientific computing. If you use DataHub (or any other data tools), please consider submitting an abstract to present for our track. Always happy to chat if there are any questions 🙂

🔥 5

teamwork 5

astonishing-lunch-91223

03/25/2022, 5:56 PM

https://twitter.com/MihaiTodor/status/1507415722988736514

teamwork 6

❤️ 3

excited 3

datahub 2

datahubbbb 2

boundless-fall-91431

05/18/2022, 3:02 PM

👋 Join us on Thursday, May 24 in SF and May 26 in Seattle to learn about Distributed SQL and enjoy some 🍺 and 🍕 • https://lu.ma/tds-sea4 • https://lu.ma/tds-sf3

teamwork 1

faint-television-78785

07/12/2022, 10:53 AM

anyone here work in the aerospace or defense industries? would be interested to hear about your work

aloof-dentist-85908

07/13/2022, 3:13 PM

Hi everybody, 😊 is there anyone working with SAP and did integrate metadata from all the different SAP tools (SAP Lumira Designer, SAP Analytics Cloud, SAP BW, SAP HANA, SAP Data Intelligence etc.)? #sap

👀 4

full-shoe-73099

07/25/2022, 1:40 PM

Salut les amis! Perhaps someone knows how to display the power bi report environment in datahub interface?

mysterious-pager-59554

07/27/2022, 2:16 PM

Hello Team Has anybody here worked on Datahub's integration with great-expectations : i.e pushing the validation results of great expectations to datahub for a CSV /Parquet file. I could only accomplish this for SQL alike data sources (i.e BigQuery)

lively-judge-30357

02/15/2022, 11:47 AM

Wasn’t sure where else to share this, but I thought it was roughly related to interests of people in the DataHub community: we’re hosting Wes McKinney, pandas creator & Apache Arrow co-creator, as a speaker next week. Wes will share updates on the recent directions and work being done in Apache Arrow, showcasing examples of working with Arrow in R, SQL and Python, and will discuss the ongoing work on high-performance C++-based query engines for Arrow. More info here: https://events.beamery.com/gresearch/g-research-distinguished-speaker-series:-wes-mckinney-xhxlxljjo (edited)

🙌 3

dazzling-appointment-34954

09/16/2022, 12:51 PM

Hey everyone, I don´t know if this the right place to post this but I created a little helper for some client projects, which might also be helpful for other people in the community. So I would like to contribute 🙂 What does it do? It uses a google sheets template to create a well formatted JSON file that can be ingested as individual datasets into your datahub instance. Through google sheets you can easily copy+paste a lot of data inside it, e.g. to document all datasets from an individual system you have at your company (we did this for example a couple of times with KPIs or SAP data). It is a very basic version that supports the main aspects of a dataset + 3 individual schema fields for now (but can easily be extended / adapted). The code should be self explanatory in the app script. Disclaimer: I am not a Software Engineer, so there is definitely room for improvement 😉 You can find the script here: https://docs.google.com/spreadsheets/d/1DFCPi2_o8oTXpwVwxDvdhLiaUaGsj5yNmN-NhJnIWxQ/edit?usp=sharing Feel free to try and give me feedback if anything does not work as expected.

teamwork 3

datahubbbb 4

catyay 1

many-rainbow-50695

10/10/2022, 6:50 AM

Hi, everyone! I've created an open business glossary from semantic data types registry. It includes about 300+ existings data types like: classification codes, personal and companies identifiers, geocodes and e.t.c. Most common data types and dozens country and language specific data types like UK Ward code or French SIRET code. I am also working on integration of semantic data types detection integrated with Datahub ingestion. I've already published metacrafter tool that allows automatic identification of semantic data types, next step is to do it compatible with Datahub API and ingestion process. It available here https://github.com/apicrafter/metacrafter-registry/blob/main/data/datahub/metacrafter.yml Feel free to contact me if you and provide your feedback.

❤️ 2

crooked-van-51704

01/03/2023, 4:33 PM

Hey, at the end of the year we were really sad that the Postgres source didn’t have support for table lineage yet and we really needed it, so … we wrote it and want to share with everyone else. We have open sourced a new package

datahub-postgres-lineage

, it behaves as a new “data source” for now, but it only emits lineage for Views in postgres. The package is already available on PyPi and can be easily installed using

Copy code

pip install datahub-postgres-lineage

For now, we decided to release the package as a standalone data source so that people could try it right away, but we plan to propose this to be included the built-in Postgres data source as one more option, similar to Snowflake.

plus1 8

🙌 7

teamwork 9

thank you 1

bland-orange-13353

01/25/2023, 6:04 AM

This message was deleted.

gentle-lifeguard-88494

02/24/2023, 8:07 PM

Hey everyone, I was able to use the custom metadata model to get distinct values for all columns! This is only for low cardinality columns (Cardinality.FEW) based off of the GE definition. It was really nice not having to code any React/Typescript to get it to show in the UI as well phew Thought I would share since it took me a little bit to orient myself to the open-source project, but I'm proud of this small achievement! If anyone is interested all the steps involved (could be helpful for those doing it for the first time), I can share more details as well. Thanks for everyone who helped out! @orange-night-91387 @bulky-soccer-26729 @curved-planet-99787 @astonishing-answer-96712 Looking forward to making bigger contributions in the future 💪 P.S. The data shown is from the public Chinook sample database - so no PII issues to worry about from the screenshot

👍 4

teamwork 8

🙌 2

sparse-address-17104

03/16/2023, 8:56 PM

Hi everyone, just contribuying to the group I have built a datahub documentation to deploy datahub by straightforward step-by-step process and I´m using dbt/trino push/pull datasources, If there is anyone interested it can be read here: https://www.linkedin.com/feed/update/urn:li:activity:7037156471440072704/

🙏 1

sparse-address-17104

03/16/2023, 8:56 PM

medium as well: https://medium.com/@leandro.totino87/centralized-open-data-plataform-jupyterhub-trino-dbt-grafana-minio-hive-datahub-751d320ff8d7

hallowed-microphone-6899

03/22/2023, 1:16 PM

Hi Team,I just follow Airflow Integration to do , but airflow log pending ,status picture 1 , airflow connections picture 2; code is in picture 3 env info : airflow = 2.5.2 ( standalone), acryl_datahub_airflow_plugin = 0.10.0.6, Python 3.9.6. Can you tell me who to do ?

helpful-librarian-40144

04/05/2023, 4:18 AM

datahub dataset search is very slow, how to debug the rootcause

gentle-continent-23026

04/06/2023, 5:28 PM

Hi, I have built a simple DataHub document QA chat app by using OpenAI and Langchain, just a simple example for fun. If anyone is interested can play with it. Local dev guide : https://huggingface.co/spaces/abdvl/datahub_qa_bot/blob/main/README.md

busy-eye-72759

05/14/2023, 7:52 PM

Hi Team! I've been a satisfied consumer of DataHub for about a year now, and today I'd like to show you the open source extension that I've made to extract column-level lineage for SQL Server. Together with some lines of the DataHub Python emitter or similar it is easy to integrate with DataHub's fine-grained lineage, or just to add some simple table-to-view lineage. It maps selects, updates and inserts, as well as add relations to other procedure executions. It is still a work in progress but I've found that it maps about 80% of our lineage in an environment of 3k entities 🙂

omg 1

👏 8

powerful-monitor-13002

07/10/2023, 1:27 AM

Hello everyone, there has been some interest to generate column level lineage for Spark job. Right now only dataset-dataset level lineage is supported with the datahub spark listener. I have recently used Spline project to generate said lineage from Spark Job and repurposed the output to fit into the

FineGrainedLineage

construct. Some cool features of spline: It supports a lot more low level Spark commands along with support for multiple data providers out of the box such as : Kafka, Mongo, ES, Hive, JDBC, Cassandra, etc

lemon-yacht-62789

07/27/2023, 3:41 PM

Hello all! For anyone interested in a case study within the realm of digital media, I've just published a blog on our use of DataHub (among other things) at Business Insider: https://medium.com/insider-inc-engineering/observable-data-quality-with-elementary-and-datahub-6fa5f92f2c81

datahubbbb 4

🔥 8

👍 2

freezing-air-36717

09/19/2023, 10:43 PM

I would like to see Datahub in action. Besides the demo on the site, are there Datahubs publicly available?

limited-library-89060

09/28/2023, 1:23 AM

Hi everyone!, we just published our first DataHub related medium article 🥳. In this article we write about our observability journey and how it can solve some data problems on the way. 🌎 https://medium.com/data-engineering-indonesia/defusing-data-time-bombs-with-datahub-observability-a015cca9a0b6 I hope you guys can resonate on some points, thanks 😁

👀 1

👏 4

shy-ocean-63027

10/08/2023, 5:13 PM

https://medium.com/@matt_weingarten/current-2023-takeaways-6d6253b344e6

👀 1

flaky-gpu-393

11/01/2023, 10:26 PM

Hi everyone! We just built a product that ingests your application-level code and your metadata to provide you more contextual understanding of your data (e.g. auto-generates catalogs descriptions with app context like “when a

user

submit’s

userDeactivationForm.js

from their profile page at website.com/deactivate, user.active_status = `no_longer_active`”). It’s free. Let me know if you want to try it out 🙂 I can send you the demo link

mammoth-dinner-7870

12/05/2023, 5:27 AM

Q - Anyone here doing data quality on reasearch papers (not reasearch itself). Working on a synthetic data startup and using models to create various scores for unstructured data. Methods could be applied to data quality at scale with fuzzy, unclean information https://www.linkedin.com/posts/koconder_generatieveai-reasearch-autonomousagents-activity-7137660229537722368-hQen (demo) - Currently scoring reasearch papers but also considering internal memos etc on a graph with tags to drive automatic data governance based on rules (i.e incomplete document)

full-afternoon-98304

12/07/2023, 3:49 PM

Good Morning, is there a place where I can find all the case studies shared with the Community? 🔎👀

👍🏾 1

busy-gigabyte-97279

12/19/2023, 6:24 PM

Have you ever struggled to push data catalog adoption/use through your organization (or any other change for that matter)? If so, I have the post for you! It talks about a framework that I have used called ADKAR to frame how to approach getting change through. Comments and thoughts welcome! https://open.substack.com/pub/dagworks/p/winning-hearts-and-minds-at-work

plus1 1

✅ 1

shy-ocean-63027

03/05/2024, 4:49 PM

Seeing DataHub mentioned! https://towardsdatascience.com/building-a-data-platform-in-2024-d63c736cccef

👍 1