Hi I was playing around with `datahub ingest list runs` and DataHub #ingestion

Hi! I was playing around with `datahub ingest list...

high-hospital-85984

09/21/2021, 10:00 AM

Hi! I was playing around with

datahub ingest list-runs

and got presented with something unexpected. Most IDs are random GUIDs, and one suspicioulsy large run wrt row count is called

no-run-id-provided

. We primarily use the kafka sink, is there a way of providing some human-readable name to the runs for easier rollback?

mammoth-bear-12532

09/21/2021, 2:12 PM

The

no-run-id-provided

rows were ingested before the ingestion framework had the ability to add run ids on ingestion. I have been thinking about this readability issue as well. Today, run ids can be specified in the ingestion yaml as part of config Eg

run_id: looker

will attach static run_id to each run.

mammoth-bear-12532

09/21/2021, 2:13 PM

datahub ingest show —run-id RUN_ID

will provide you a summary of each run with sample rows ingested

high-hospital-85984

09/21/2021, 2:57 PM

Oh nice! I expected there to be an option lile that but failed finding it! 😀 going to add it now to all our recipes

mammoth-bear-12532

09/21/2021, 3:04 PM

I was going to experiment with dynamic run ids using env var expansion. Let me know if you come up with something nifty.

high-hospital-85984

09/21/2021, 4:01 PM

First thought was to do

export RUN_ID_SUFFIX=$(date +%s)

before the run and have

run_id: looker_${RUN_ID_SUFFIX}

in the config. Works for us as we anyway run some preparation scripts before the actual ingestion

mammoth-bear-12532

09/21/2021, 4:24 PM

hmm there might be a way to do it in code also ... hang on

loud-island-88694

09/21/2021, 5:51 PM

We have a PR open for this: https://github.com/linkedin/datahub/pull/3279/

high-hospital-85984

09/21/2021, 7:31 PM

That was quick! 😅

mammoth-bear-12532

09/21/2021, 8:02 PM

That's @loud-island-88694 executing 🏃

Open in Slack

Previous Next