Hi all, I’m new using Datahub trying to ingest som...
# ingestion
a
Hi all, I’m new using Datahub trying to ingest some files as part of a POC. I’m trying to modify the example/demo_data using my own files but I’m having troubles with it: "generate_demo_data.sh" depends on the file “all_covid19_datasets.json” that is not available. If I try to run "bigquery_covid19_to_file.yml", I don't have credential for BigQuery. I assume if I want to prepare my files using “enrich.py” I should follow the same layout (json) as the “all_covid19_datasets.json” file but because the file is not available I can’t figure out the layout. To recap, I’m trying to run “enrich.py” using my files as the input file but can’t figure out what the layout of the input file should be. I understand it has to be json, but I also understand there are different json layouts like “split”, “table”, etc. Any help or direction is much appreciated. Thanks!
m
Hi @aloof-gigabyte-305: welcome to DataHub! Is there a specific reason you are trying to follow the
enrich.py
style ... or would you be fine working with just raw json files?
a
Thank you Shirshanka, no, there is not a special reason, I just read some of the comments and thought it was a good point to start. Also I haven't able to figure out other way (sorry I'm new here). I checked the file ingestion page (https://datahubproject.io/docs/metadata-ingestion/sink_docs/file#config-details) but can't figure out how to give the name of the table or the name within the Dataset folder (like "prod" as the sample data). In other words, how to set up the recipe.
Now I'm trying this simple recipe:
source: type: file config: # Coordinates filename: ./my_datasets.json sink: type: "datahub-rest" config: server: "http://localhost:8080"
but I'm getting following error: ValueError: com.linkedin.pegasus2avro.usage.UsageAggregation is missing required field: bucket
m
@aloof-gigabyte-305: the bootstrap_mce.json is a good file to start with under (metadata-ingestion/examples/mce_files)
a
Thank you again Shirshanka, using the bootstrap_mce.json works and everything starts to make sense. Next I think I need to convert my json files (created using python pandas.to_json("data.json", orient="split" )) into MCE objects. What is the best way to do that? I'm trying to find documentation about the MCE object, actually the description of each of its elements and how they should be used but can't find it yet. Any help on that will be also appreciated.
m
@aloof-gigabyte-305 all the objects are available as classes in the
schema_classes.py
file in the metadata-ingestion module.
You can find examples of using these classes throughout the code-base in the
metadata-ingestion/src/datahub/ingestion/sources.
a
Thanks again Shirshanka. Just to let you know that with the samples you mentioned I was able to ingest many of my tables into DataHub. I'll continue ingesting data assets (and I'm sure I will have more questions). THANKS !
🎉 1