hi, I am new to Datahub and plan on using it. But ...
# ingestion
r
hi, I am new to Datahub and plan on using it. But I had a few questions: 1. How is the metadata brought into Datahub. I see there are ingestion scripts. But is there way for each of the data sources to push the metadata to spark topics (push-based architecture) instead of periodically calling the ingestions scripts? 2. Are owners added manually or should there be an "owners" field in the json metadata? 3. How are table descriptions and column descriptions added? Are they manually created through the UI? Or should there be a "description" field in the json metadata for both the tables and the columns?
b
Hi Kalyan! Welcome to the community. 1. We do support "push" of metadata for particular systems (eg. Airflow), but most folks prefer to manage periodic pull+push jobs to batch ingest metadata. 2. I do not believe owners are auto-populated using the Ingestion framework @gray-shoe-75895 to confirm. This means you'd either have to write a transformer to enrich the metadata to include owner or add via UI 3. Table descriptions and columns descriptions can technically be written using either the ingestion framework or the UI. Typically, they are provided via the UI unless the data platform from which we are ingesting is capable of providing that rich metadata
1
m
@red-journalist-15118: DataHub excels at push-based integration. If you can push over HTTP or Kafka, you can send metadata from anywhere to DataHub.
We have convenience methods in Python listed here: https://datahubproject.io/docs/metadata-ingestion/#using-as-a-library
but you can always emit metadata from your favorite language to DataHub, as long as you can write Avro to Kafka.
@red-journalist-15118: for your question on ownership and schema / field description metadata: you can send it via the push-API as well as edit it via UI .. (we store them separately to keep the metadata storage separate)
b
Correct. You can send it, but our default ingestion adapters do not populate for you
r
@big-carpet-38439 if the default ingestion does not populate it, how can I populate once I push/pull it from the source (e.i. Hive)?
b
@red-journalist-15118 Today, you have a few options: 1. Write a Transformer that is capable of resolving Ownership given a Dataset record 2. Write a custom Python flow to find the ownership for your datasets and push it to DataHub using the Emitter APIs 3. Add Ownership information inside of DataHub UI manually
Where is your ownership information located?
r
@big-carpet-38439 we store owner info and table description and column description as a field in the json metadata
b
Is it a hive table property?
r
yeah
m
@red-journalist-15118: is this a custom format that you have at your company, or a standard format that many deployments use?
trying to figure out if we just need to provide pluggability here.. or an out-of-the-box solution for this
b
+1^
r
I am actually not sure about this. I can ask my team this week and follow up! I really appreciate all your guys help! You guys are amazing!
🎉 1
b
Have no fear! We'll figure something out!
Once you have that talk we can schedule some time to try to figure out something that could work for you but also be made general 🙂
c
Hi @big-carpet-38439 i'm interested at your point number 3 above, as i'm having the same case as Kaylan. My question is how (technically) to ingest field description from the ingestion framework? is it possible to write it in the
recipes
file so we can just use
datahub ingesct -c
command?
m
@chilly-spring-43918: field descriptions are automatically ingested AFAIK. Which source are you connecting it to?
c
i'm trying to ingesting from Hive source, let me give you the screenshot
Here is my structure table on Hive, but when i tried to export it to file, the description is null.
m
Thanks @chilly-spring-43918, maybe the sqlalchemy driver (pyhive) that we're using does not pull these fields. We'll take a look.
@gray-shoe-75895: ^^
c
thank you @mammoth-bear-12532
Hi @gray-shoe-75895 @mammoth-bear-12532, apologize to follow this up, do you guys had a chance to take a look on this matter?
m
@chilly-spring-43918: we are looking into the
pyhive
implementation. Will get back to you in a day or two.
c
@mammoth-bear-12532 Thank you very much
p
have the same issue with ingesting descriptions from hive. Looking forward for possible solution. Thank you guys
m
@prehistoric-doctor-36763 thanks for letting us know. Watch this thread, will get back soon.
g
Hi @chilly-spring-43918 and @prehistoric-doctor-36763, this should be fixed now in acryl-datahub version 0.3.0 - let me know if you run into any issues with it!
🙌 2
m
@red-journalist-15118: FYI