DataHub #integrate-databricks-datahub

little-megabyte-1074

02/11/2022, 11:08 PM

set the channel description: Central channel to collaborate on Databricks integration

bumpy-furniture-4631

02/14/2022, 3:01 AM

Hey Maggie, thanks for setting this up. Can you help me understand the current status of the integration? Are there specific things needed from the Databricks team for this(I can ask folks at Databricks to prioritize it from our end)?

careful-pilot-86309

02/14/2022, 12:11 PM

@bumpy-furniture-4631 Though set up is little bit complicated, currently we are able to push lineages from databricks to datahub. We would like to try this on real time setup and see how usable current mappings ( like pipeline name) are.

little-megabyte-1074

02/14/2022, 3:00 PM

<!channel> Hello, folks! Hope you all had a wonderful weekend 🙂 Excited to announce that @careful-pilot-86309 & @elegant-doctor-86344 are working on a Databricks <> DataHub integration. We’re eager for folks in the Community to test out the integration & to provide feedback early on! If you’re able to help us out, please let us know!

🙌 1

prehistoric-room-17640

02/14/2022, 3:00 PM

WOOOO HOOO!

prehistoric-room-17640

02/14/2022, 3:02 PM

Absolutely. Let me know how.

quiet-kilobyte-82304

02/14/2022, 3:02 PM

Let me know as well. Is there a writeup on how this might work?

bumpy-furniture-4631

02/14/2022, 3:15 PM

@careful-pilot-86309 @little-megabyte-1074 I can help with the integration testing. Please point me to the docs

loud-island-88694

02/14/2022, 3:24 PM

Hello All - to clarify the initial scope of work, @careful-pilot-86309 has been working on Databricks spark lineage. We will update the documentation soon. Support for deltalake, notebooks etc. will come in the future

👍 2

careful-pilot-86309

02/16/2022, 6:29 AM

Hello All - Appreciate your enthusiastic response. I am attaching usage instructions and jar. Please try it and let us know your feedback. Please note that this is basic beta version.

DATABRICKS_README.pdf datahub-spark-lineage-databricks.jar

teamwork 1

prehistoric-room-17640

03/09/2022, 3:18 AM

are there any updates on deltalake support?

plus one 1

lemon-terabyte-66903

03/10/2022, 4:11 PM

Hello, Is there support for

databricks

platform in lineage? I would like to have a custom lineage with s3 datasets and databricks jobs.

careful-pilot-86309

03/10/2022, 6:40 PM

Right now, we are supporting hdfs and few jdbc sources ( hive,oracle,mysql etc) on databricks

careful-pilot-86309

03/10/2022, 6:42 PM

Bellow are usage instruction for datahub-databricks: https://datahubspace.slack.com/files/U02HE6R3F5L/F0339SXFSJF/databricks_readme.pdf https://files.slack.com/files-pri/TUMKD5EGJ-F033NFEFR97/download/datahub-spark-lineage-databricks.jar Let me know if you are trying this. I can help with setup

datahub-spark-lineage-databricks.jar DATABRICKS_README.pdf

modern-belgium-81337

04/27/2022, 10:32 PM

Copy code

master  databricks fs --overwrite datahub-spark-lineage*.jar dbfs:/datahub
Usage: databricks fs [OPTIONS] COMMAND [ARGS]...
Try 'databricks fs -h' for help.

Error: No such option: --overwrite Did you mean --version?

Hi, I’m trying to follow the doc here but it seems like the command hasn’t been updated?

careful-pilot-86309

04/28/2022, 3:39 PM

--overwrite is just an optionto overwrite it if that jar is present before. If not supported on your environment, you can skip it.

teamwork 1

careful-pilot-86309

04/28/2022, 3:43 PM

I have created document with databricks cli Version 0.16.3

creamy-tent-10151

07/29/2022, 5:33 PM

Hi all, is there a way to change the spark task name? right now it's just picking up my queries and using that as the name

silly-finland-62382

08/26/2022, 9:19 AM

Hey, can someone help me to build spark lineage on databricks ?

bumpy-furniture-4631

09/04/2022, 10:19 PM

Hey Guys, are there any plans to ingest Data Lineage from Databricks Unity Catalog? The feature is currently in Private Preview. And they have APIs to export the lineage info btw.

loud-island-88694

09/05/2022, 8:15 PM

@bumpy-furniture-4631 this is on our roadmap. Contributions are welcome if you have the bandwidth

❤️ 1

careful-action-61962

09/30/2022, 9:31 AM

You can do this. Create a cluster in Single User Mode, It connects to unity catalog. Create a Personal Token for that user and configure it in datahub using Hive connector.

Copy code

spark.databricks.sql.initial.catalog.name <unity catalog name>

add this in your spark cluster config and you're good to go. Please make sure your user has select permission on tables. If not, run this:

Copy code

catalogs = spark.sql('show catalogs;');
for catalog in catalogs.toPandas()['catalog']:
  if catalog in ['default', 'samples']:
    continue
  print(catalog)
  use_catalog = f"USE CATALOG {catalog};"
  print(use_catalog)
  spark.sql(use_catalog);
  show_db = f"SHOW DATABASES;"
  print(show_db)
  dbs = spark.sql(show_db);
  for db in dbs.toPandas()['databaseName']:
    spark.sql(f"grant usage on database {db} to `datahub`;")
    if db in ['temp_notebooks', 'temp']:
      continue
    show_table = f"SHOW TABLES IN {db};"
    tables = spark.sql(show_table);
    for idx, row in tables.toPandas().iterrows():
      table = row['database'] + "." + row['tableName']
      grant_query = f'grant select on table {table} to `datahub`;'
      print(grant_query)
      spark.sql(grant_query);

numerous-yak-58823

10/03/2022, 2:38 PM

Hello, I was reading the documentation: https://datahubproject.io/docs/metadata-integration/java/spark-lineage/ And it says: Note that testing for other environments such as Databricks is planned in near future. Do you happen to know when Databricks will be officially supported?

hallowed-shampoo-52722

02/09/2023, 6:02 PM

Hi Team, We have integrated databricks with datahub.. I have created a recipe databricks+pyhive Spark agent is installed but I dont see pipeline tasks. Any idea why that’s happening?

hallowed-shampoo-52722

02/13/2023, 9:27 PM

Hi Guys, Could you please help me with the lineage here!!

fierce-animal-98957

04/25/2023, 5:47 AM

Hi, We are validating data using Great Expectations inside a Databricks Notebook, and now trying to integrate DataHub in the same Notebook. Is that even possible to connect to DataHub from with in Great Expectations, and everything running inside one single Databricks notebook?

fierce-animal-98957

05/02/2023, 4:25 PM

Hi Team, We are using “DataHubValidationAction” to send assertions metadata to DataHub. We are running this from inside Databricks that uses Spark engine. From the documentation, this currently works only with “SqlAlchemyExecutionEngine”. Do anyone of you know when this class will be enhanced to add Spark engine support? Anything on the roadmap? https://datahubproject.io/docs/metadata-ingestion/integration_docs/great-expectations/#capabilities https://docs.greatexpectations.io/docs/integrations/integration_datahub/

gentle-arm-6777

06/29/2023, 3:58 PM

Hi Guys! I set up databricks spark lineage, but i dont see any pipeline on datahub after execute a notebook on databricks. The logs shows that listener ir registrated, but any datahub emitter message is not exists on logs. Any help?

bulky-shoe-65107

10/16/2023, 12:38 AM

has renamed the channel from "integration-databricks-datahub" to "integrate-databricks-datahub"

few-piano-98292

03/06/2024, 4:28 PM

Hello, would appreciate any feedback/help to move forward!