Hi all I m new using Datahub trying to ingest some files as DataHub #ingestion

Hi all, I’m new using Datahub trying to ingest som...

aloof-gigabyte-305

09/09/2021, 5:21 PM

Hi all, I’m new using Datahub trying to ingest some files as part of a POC. I’m trying to modify the example/demo_data using my own files but I’m having troubles with it: "generate_demo_data.sh" depends on the file “all_covid19_datasets.json” that is not available. If I try to run "bigquery_covid19_to_file.yml", I don't have credential for BigQuery. I assume if I want to prepare my files using “enrich.py” I should follow the same layout (json) as the “all_covid19_datasets.json” file but because the file is not available I can’t figure out the layout. To recap, I’m trying to run “enrich.py” using my files as the input file but can’t figure out what the layout of the input file should be. I understand it has to be json, but I also understand there are different json layouts like “split”, “table”, etc. Any help or direction is much appreciated. Thanks!

mammoth-bear-12532

09/09/2021, 7:49 PM

Hi @aloof-gigabyte-305: welcome to DataHub! Is there a specific reason you are trying to follow the

enrich.py

style ... or would you be fine working with just raw json files?

aloof-gigabyte-305

09/09/2021, 8:13 PM

Thank you Shirshanka, no, there is not a special reason, I just read some of the comments and thought it was a good point to start. Also I haven't able to figure out other way (sorry I'm new here). I checked the file ingestion page (https://datahubproject.io/docs/metadata-ingestion/sink_docs/file#config-details) but can't figure out how to give the name of the table or the name within the Dataset folder (like "prod" as the sample data). In other words, how to set up the recipe.

aloof-gigabyte-305

09/09/2021, 8:51 PM

Now I'm trying this simple recipe:

aloof-gigabyte-305

09/09/2021, 8:51 PM

source: type: file config: # Coordinates filename: ./my_datasets.json sink: type: "datahub-rest" config: server: "http://localhost:8080"

aloof-gigabyte-305

09/09/2021, 8:52 PM

but I'm getting following error: ValueError: com.linkedin.pegasus2avro.usage.UsageAggregation is missing required field: bucket

mammoth-bear-12532

09/09/2021, 9:12 PM

@aloof-gigabyte-305: the bootstrap_mce.json is a good file to start with under (metadata-ingestion/examples/mce_files)

aloof-gigabyte-305

09/10/2021, 11:58 AM

Thank you again Shirshanka, using the bootstrap_mce.json works and everything starts to make sense. Next I think I need to convert my json files (created using python pandas.to_json("data.json", orient="split" )) into MCE objects. What is the best way to do that? I'm trying to find documentation about the MCE object, actually the description of each of its elements and how they should be used but can't find it yet. Any help on that will be also appreciated.

mammoth-bear-12532

09/10/2021, 2:21 PM

@aloof-gigabyte-305 all the objects are available as classes in the

schema_classes.py

file in the metadata-ingestion module.

mammoth-bear-12532

09/10/2021, 2:22 PM

You can find examples of using these classes throughout the code-base in the

metadata-ingestion/src/datahub/ingestion/sources.

aloof-gigabyte-305

09/14/2021, 12:28 PM

Thanks again Shirshanka. Just to let you know that with the samples you mentioned I was able to ingest many of my tables into DataHub. I'll continue ingesting data assets (and I'm sure I will have more questions). THANKS !

🎉 1

73 Views

Open in Slack

Previous Next