I'm new to DataHub, it looks very promising for me...
# ingestion
w
I'm new to DataHub, it looks very promising for metadata management. I want to take the path "file to datahub (REST)" yml config. However, I'm clueless what different fields mean. Is there any sort of doc with example to get me up to speed here? I'd really appreciate this help.
b
There are several components in the yml config: 1. Source of metadata 2. Transformers applied (can be none) 3. Sink ( where the metadata in written to). The file to datahub rest basically reads a specified json file containing the metadata to be written and writes to rest endpoint. For config options for the source and sink, can refer to https://github.com/linkedin/datahub/tree/master/metadata-ingestion there are source docs and sink docs folder.
w
which YML is referred and which json data file is consumed when the sample data is ingested?
datahub docker ingest-sample-data
b
the code snippet here is responsible for the command: https://github.com/linkedin/datahub/blob/214215759011fc983d4bfda16dab5630a02bfa14/metadata-ingestion/src/datahub/cli/docker.py it ingests
metadata-ingestion/examples/mce_files/bootstrap_mce.json
the yml file is defined on the fly in this case
w
@witty-keyboard-20400 The sample data are the bootstrap mces here
w
@better-orange-49102 and @witty-state-99511 Thanks for replying to my question. I had found that input file
{datahubrepo}/metadata-ingestion/examples/mce_files/bootstrap_mce.json
by
grepping
for the keywords seen on the UI. But {datahubrepo} folder is my checked-out repo from Github. I changed entries in that file, but the change was not reflected in the UI after the ingestion. So, my understanding is that the docker command:
datahub docker ingest-sample-data
...doesn't care about what I've done in a checked-out repo. The command refers to the bootstrap_mcs.json somewhere else. Could any of you help me understand where is the actual yml and json located which is used by the docker command
datahub docker ingest-sample-data
?
b
the ingest sample data is not using the local repo, but rather, downloading the file from the GitHub repo online.
2
you shd not use ingest-sample-data command, but rather at the metadata-ingestion folder use
Copy code
datahub ingest -c ./recipes/file_to_datahub_rest.yml
then u can modify ur MCE.json
🙌 1
2