Hey team :wave::skin-tone-2: I'm investigating pos...
# contribute-code
d
Hey team šŸ‘‹šŸ» I'm investigating possible ways to support a use-case of us on Datahub. We have some delta lake formatted tables which lies inside several s3 buckets, and they are not grouped in some root prefixes. These tables are present in hive metastore but hive cannot read the column structure of them. We want to ingest the metadata of these tables properly but: • Hive ingestion is reading them already and cannot read the schemas of these tables. • Delta lake ingestion does not fit to our use case since we don't have some s3 prefixes that delta tables can be in. We don't want to traverse all s3 buckets over and over again. I'm wondering if I can solve the issue by adding a new parameter to the hive ingestion like "ingest_delta_tables_if_you_encounter_one", and check if hive ingestion finds a table with property
spark.sql.sources.provider
=
delta
. If such a table is found, we can fall back to existing delta table ingestion's method to get the metadata of the table instead of depending on the metadata on hive server. How does that sound? Would you be happy with such an addition? I didn't want to blindly raise a pr without taking your thoughts.
h
These tables are present in hive metastore but hive cannot read the column structure of them
I would like to understand this a bit more, as this is the primary reason behind your contribution. Do you happen to know if this is a known limitation / bug reported anywhere ?
d
Yes this is a known limitation of delta lake. See their doc: https://docs.delta.io/latest/hive-integration.html In here, it directs you to a connector you can use with hive but the connector is very limited: https://github.com/delta-io/connectors/tree/master/hive#frequently-asked-questions-faq as a tldr: Your table definitions can live (partially) inside hms, but hive will not be able to query the table metadata or read the actual data. BTW, I'm not an expert in detla format. If what I'm doing does not make sense, please correct and raise any of your concerns.
h
So when creating table in hive, we need to specify schema manuallyas mentioned here. From this, schema should ideally be extracted when using hive connector. You are using this hive connector in DataHub right ? What you are trying to do makes sense in general, however I am trying to understand, what additional information DataHub will be able to extract due to that.
d
@hundreds-photographer-13496 I couldn't get a definitive answer from our data platform team internally yet, but I know that using this connector is discussed and we decided that it will not solve the problems in a nice way. As far as I know, you'll need to create a duplicate table on hive, in addition to the original table created from spark using delta format and you will be able to query the underlying detla data from hive using the duplicate table. This works since both tables will point to the same data location. The issue with that is, since you'll have two copies of the same data, and all systems will use the one created by spark(the original one), the lineage of the table will be observed for the spark's version. This table will have incomplete metadata as the connector does not change the behaviour in here. Users of datahub will still see incomplete metadata for delta lake's spark sourced tables and will need to manually check their hive duplicated to see structure of the tables. Hope these makes sense
h
I haven't understood the setup completely yet, primarily how is the delta lake table created in spark present in(/linked to) hive, if you have not created it in hive using CREATE EXTERNAL TABLE yet. However, from what I understand, you are facing the problem that two tables will point to same table, you can make use of siblings feature. See here At the same time, it might be okay to fetch and emit info by pulling it from underlying delta table through hive source...what is the end goal of this PR - are you planning to emit the info retrieved from underlying delta table with hive entity urn corresponding to delta table ?
b
Hey @hundreds-photographer-13496 you can create a table from spark using hive catalog. for example
Copy code
df.write.format("delta").mode("overwrite").saveAsTable("xxx")
actually you can create any type of table from spark without running create table stmt on hive
Actually, delta tables are not using the catalog system to derive the table structure, table structures are written in
_delta_log
so there is not any dependency between hive and spark when it comes creating external tables
Actually, https://datahubproject.io/docs/generated/ingestion/sources/delta-lake/ does not use hive metastore. If there was such a restriction, I imagine that this adaptor wouldn't be required.