Hi everyone I am new to DataHub We have a use case to solve DataHub #getting-started

Hi everyone! I am new to DataHub! We have a use-ca...

melodic-match-31187

07/12/2022, 7:23 AM

Hi everyone! I am new to DataHub! We have a use-case to solve for which we are looking into DataHub. We have our raw data coming from Kafka to GCS in Hudi format (raw layer). Next we want to create derived tables/layer after proper Schema validation check. We will be using Spark to move/process data from raw layer to derived layer. What we want to achieve via DataHub: • Store schema of tables in DataHub • Get the schemas in Spark from DataHub • Apply the schema to the data PS: We don't only want to use DataHub as Schema Registry but this one of the use-case we want to solve.

mammoth-bear-12532

07/12/2022, 7:40 AM

This is a pretty cool use-case! Does this allow you to skip registering the schema in another metastore (like Glue or Hive?)

melodic-match-31187

07/12/2022, 7:46 AM

No, in addition do this we would also register in Hive metastore.

mammoth-bear-12532

07/12/2022, 7:49 AM

in this case, what is the advantage of getting the schema from datahub versus getting it from Hive metastore?

melodic-match-31187

07/12/2022, 11:54 AM

@mammoth-bear-12532

Copy code

[
 {
  "event": "E1",
  "properties": "{some fields}"
  },
  {
  "event": "E2",
  "properties": "{some fields}"
  }
]

Our Raw data is array of JSON. While creating the derived tables, we will segregate each event as a separate table and flatten the complex

properties

thereby enforcing schema for each event type/table. If schema validation fails, we will move the data to error/quarantine location. I don't see such capabilities with Hive Metastore. Correct me if I am wrong.

melodic-match-31187

07/13/2022, 8:04 AM

@mammoth-bear-12532 Thoughts?

mammoth-bear-12532

07/19/2022, 10:50 PM

Hi @melodic-match-31187 sorry for missing this message, yes this does make sense at a high level... So if I understand correctly you will fetch the "expected schema" for event E1 from datahub... then ensure that it is met ... and then add it to the existing set of events for that split dataset?

6 Views

Open in Slack

Previous Next