Hi everyone! I am new to DataHub! We have a use-ca...
# getting-started
m
Hi everyone! I am new to DataHub! We have a use-case to solve for which we are looking into DataHub. We have our raw data coming from Kafka to GCS in Hudi format (raw layer). Next we want to create derived tables/layer after proper Schema validation check. We will be using Spark to move/process data from raw layer to derived layer. What we want to achieve via DataHub: • Store schema of tables in DataHub • Get the schemas in Spark from DataHub • Apply the schema to the data PS: We don't only want to use DataHub as Schema Registry but this one of the use-case we want to solve.
m
This is a pretty cool use-case! Does this allow you to skip registering the schema in another metastore (like Glue or Hive?)
m
No, in addition do this we would also register in Hive metastore.
m
in this case, what is the advantage of getting the schema from datahub versus getting it from Hive metastore?
m
@mammoth-bear-12532
Copy code
[
 {
  "event": "E1",
  "properties": "{some fields}"
  },
  {
  "event": "E2",
  "properties": "{some fields}"
  }
]
Our Raw data is array of JSON. While creating the derived tables, we will segregate each event as a separate table and flatten the complex
properties
thereby enforcing schema for each event type/table. If schema validation fails, we will move the data to error/quarantine location. I don't see such capabilities with Hive Metastore. Correct me if I am wrong.
@mammoth-bear-12532 Thoughts?
m
Hi @melodic-match-31187 sorry for missing this message, yes this does make sense at a high level... So if I understand correctly you will fetch the "expected schema" for event E1 from datahub... then ensure that it is met ... and then add it to the existing set of events for that split dataset?