A complete solution for open data platforms, enterprise data catalogs, data lakes and data management. Open source, mature, fully-featured and production ready.

DataHub

I'm trying to generate a file to be used in File metadata ingestion. For this I used the following yml
source:
  type: file
  config:
    filename: ./data.json

sink:
  type: file
  config:
    filename: ./output.json

<@U02MMLHSL05>, how have you generated the data.json input file?

<@U02AYK6RSPP> data.json contains raw data from kaggle <https://storage.googleapis.com/kagglesdsdata/datasets/20079/26025/iris.json?X-Goog-Algorithm=GOOG4-RSA-SHA256&amp;X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20211117%2Fauto%2Fstorage%2Fgoog4_request&amp;X-Goog-Date=20211117T184236Z&amp;X-Goog-Expires=259199&amp;X-Goog-SignedHeaders=host&amp;X-Goog-Signature=9691d38fb61d386860b69815b0ef212acbdac2d8126004ec69b7c87b2aea08c01452b200d474d2e9cc4bf0e3d8fad84537869fb647bf11ab030dc38dc13e877782ede3e359be1cc67d34840ea9c5dbbc0ca7f98f58a1ee04964b24a0ef154e18f7fc8d8755814e923a9842f7df5d6e08a5322075dfc2ba630c4cafa94f66de0c849197fa8cd2b8bcd3baf4bf881dd715b0d0063ffd6a97e8147bd00fc0121ddd0472f87ee07c61411fcd61a5cf7a21ea5ce3c195cd6bdd49c3ee38b9d8c442b28f3155652db57c339dc990233f3855c429583f504b117930dfba48df4da9678869dfb61671d9c75b2155fa159d027abdd002ccc67750437c4c44e0089ff76390|here>

I was expecting the .yml file to convert this into a format which I can then use to sink to datahub

<@U02AYK6RSPP> or should I have the data in a source like a database and then extract metadata automatically from there?

Hi <@U02MMLHSL05>,
The input file should be in a specific format that datahub can understand. Checkout the capabilities section here - <https://datahubproject.io/docs/metadata-ingestion/source_docs/file>

<@U02AYK6RSPP> It is a bit unclear whether the input file is the raw data itself or the metadata?

<@U02MMLHSL05>: the input file is the metadata, not the raw data.

<@UV0M2EB8Q> is there any documentation/template for this input file? The examples <https://github.com/linkedin/datahub/tree/master/metadata-ingestion/examples/mce_files|here> seem to differ a lot

Hi <@U02MMLHSL05> what are the kinds of metadata about the dataset that you are trying to capture?

something like : dataset-name, schema, column-level samples ?

<@UV0M2EB8Q> yeah, exactly. plus data policy, size, number of fields in the data etc.,

if you are comfortable with python, programmatically generating this metadata should be pretty easy, you shouldn’t need to write the file by inspecting other checked in example files..  however, not everything you are asking for is modeled in the metadata yet.

You can explore the metadata model for a Dataset here (<https://demo.datahubproject.io/dataset/urn:li:dataset:(urn:li:dataPlatform:datahub,Dataset,PROD)/Schema?is_lineage_mode=false>)

<@UV0M2EB8Q> Thank you. I will check it out.