Hey guys :waving-from-afar-left: I was trying out...
# ingestion
g
Hey guys waving from afar left I was trying out DataHub and t is being really cool and easy to setup. I have a requirement but I'm not sure if DataHub allows me to easily do so. I need to be able to ingest a customised
.json
or
.csv
with information regarding some dashboards that we use. We would like to extend the data discovery capabilities of DataHub with not only automatic discovery (awesome experience so far) but also manual introduced metadata. In this way we can easily add and tweak metadata according to most used report, spread across all platforms (metadata like,
title
,
description
(*with a link to the too*l),
tags
, etc.). I'm aware of the source type
file
, but it seems to verbose due to being "from a previously generated file". Is it easy develop
.json
with correct sintaxe to feed Datahub? Also noticed that
demo_data.json
is generated by a
.csv
(diretives) with the help of
enrich.py
script (source). Is it easy to tweak it to chose if it should fall under Dashboards instead of Datasets? Or even make it a feature? 😊 Thanks in advance 🙂
Also noticed that test_serde_chart_snapshot.json has the sintaxe for Charts custom data.
g
Yes you’ve got two options here 1. manually create the correctly formatted JSON files and then ingest using the file source 2. use the Python model classes + rest emitter to construct and send those results from Python
🙌 1
🙏 1
The enrich.py script is a pretty good starting point and should be relatively easy to tweak. I’d recommend using an ide and a python type checker (e.g. mypy) since it would make the development experience much easier
🙏 1
🙌 1
g
Thank you @gray-shoe-75895. Is there some documentation regarding the the JSON structure?
I would like to explore and get to know not only how "datasets" jsons are built, but also "charts", "datasets" and so on. In particular, how can we add subfolders and add a custom structure to them.
g
There’s not too much documentation about the JSON structure unfortunately - this is why I’d recommend using the Python APIs if possible. However, you can find the structure of the JSON is defined in PDL files e.g. the chart info aspect can be found here https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/chart/ChartInfo.pdl
g
Thank you so much @gray-shoe-75895 🙏