Hi < lemon hydrogen 83671> very impressed with your file bas DataHub #ingestion

Hi <@U02DMDKSZME> very impressed with your file-ba...

shy-parrot-64120

03/10/2022, 6:47 PM

Hi @lemon-hydrogen-83671 very impressed with your file-based lineage - very handy stuff for initial data bootstrap one question from my side regarding this source: does data file supports

yaml-anchors

? like this:

Copy code

version: 1
lineage:
  - entity: &dataset
      name: report.payment_reconciliation
      type: dataset
      platform: postgres
      platform_instance: mvp
    upstream:
      - entity: &datajob
          name: report.load_payment_reconciliation
          type: datajob
          platform: postgres
          platform_instance: mvp
  - entity:
      <<: *datajob
      name: report.load_payment_reconciliation
    upstream:
      - entity:
          <<: *dataset
          name: core.payment
      - entity:
          <<: *dataset
          name: core.ph2_transaction
      - entity:
          <<: *dataset
          name: core.ph2_order

afaiks answer is

no

have you any plans to do like this?

lemon-hydrogen-83671

03/10/2022, 6:49 PM

oh cool, to be honest i've never heard of yaml-anchors before this, i'd have to do some more research on it

shy-parrot-64120

03/10/2022, 6:49 PM

in out product we’ve created our own plugin to parse from following structure (source/target definitions of plpgsql procedures)

Copy code

source:
  type: plpgsql
  config:
    database: mvp
    dbms: postgres
    pipelines:
      - group: payments
        entities:
          - name: report.load_payment_reconciliation
            sources:
              - core.payment
              - core.ph2_transaction
              - core.ph2_order
            target: report.payment_reconciliation

lemon-hydrogen-83671

03/10/2022, 6:49 PM

so yeah, long-story short i'd guess no 😛

shy-parrot-64120

03/10/2022, 6:50 PM

this notation is pretty narrow (suitable for procedure to table lineage) however is much compact

lemon-hydrogen-83671

03/10/2022, 6:51 PM

yeah i can see that, makes it much cleaner!

lemon-hydrogen-83671

03/10/2022, 6:51 PM

I don't have any plans to work on it in the short term but it's open sourced so feel free to add on!

shy-parrot-64120

03/10/2022, 6:53 PM

yq can do this )

Copy code

datahub % yq 'explode(.)' plpgsql_data.yml

shy-parrot-64120

03/10/2022, 6:53 PM

is yaml processed by PyYAML?

lemon-hydrogen-83671

03/10/2022, 6:55 PM

yup. It makes use of the config_loader that the cli uses for recipes: https://github.com/linkedin/datahub/blob/d474387eeb2e092bdfc50654363bc5f6edc2b7a2/metadata-ingestion/src/datahub/configuration/yaml.py which is PyYaml i think.

shy-parrot-64120

03/10/2022, 7:13 PM

seems like it wokrs - there was a typo in datafile

🙌 1

shy-parrot-64120

03/10/2022, 7:14 PM

what types currently supported? are there any restrictions?

shy-parrot-64120

03/10/2022, 7:14 PM

Copy code

ConfigurationError: Type must be one of ['dataset'], datajob is not yet supported.

shy-parrot-64120

03/10/2022, 7:17 PM

dataset to dataset only

Copy code

---
version: 1
lineage:
  - entity: &dataset
      name: report.payment_reconciliation
      type: dataset
      platform: postgres
      platform_instance: mvp
    upstream:
      - entity:
          <<: *dataset
          name: core.payment
      - entity:
          <<: *dataset
          name: core.ph2_transaction
      - entity:
          <<: *dataset
          name: core.ph2_order

Copy code

dmytro.kulyk@MB-DAT-564087 datahub % datahub ingest -c plpgsql.yml
[2022-03-10 21:16:00,846] INFO     {datahub.cli.ingest_cli:70} - DataHub CLI version: 0.8.28.0
[2022-03-10 21:16:01,241] INFO     {datahub.cli.ingest_cli:86} - Starting metadata ingestion
[2022-03-10 21:16:01,246] INFO     {datahub.ingestion.source.metadata.lineage:175} - preserve_upstream is set to True
[2022-03-10 21:16:01,246] INFO     {datahub.ingestion.source.metadata.lineage:122} - Upstream detected for env='PROD' name='report.payment_reconciliation' type='dataset' platform='postgres' platform_instance='mvp'. Extracting urn...
[2022-03-10 21:16:01,911] INFO     {datahub.ingestion.run.pipeline:85} - sink wrote workunit lineage-urn:li:dataset:(urn:li:dataPlatform:postgres,mvp.report.payment_reconciliation,PROD)
[2022-03-10 21:16:01,911] INFO     {datahub.cli.ingest_cli:88} - Finished metadata ingestion

Source (datahub-lineage-file) report:
{'workunits_produced': 1,
 'workunit_ids': ['lineage-urn:li:dataset:(urn:li:dataPlatform:postgres,mvp.report.payment_reconciliation,PROD)'],
 'warnings': {},
 'failures': {},
 'cli_version': '0.8.28.0'}
Sink (datahub-rest) report:
{'records_written': 1,
 'warnings': [],
 'failures': [],
 'downstream_start_time': datetime.datetime(2022, 3, 10, 21, 16, 1, 475220),
 'downstream_end_time': datetime.datetime(2022, 3, 10, 21, 16, 1, 911311),
 'downstream_total_latency_in_seconds': 0.436091,
 'gms_version': 'v0.8.27'}

shy-parrot-64120

03/10/2022, 7:25 PM

need to add this somewhere in documentation

shy-parrot-64120

03/10/2022, 7:30 PM

@abundant-receptionist-6114 fyi

lemon-hydrogen-83671

03/10/2022, 7:55 PM

it's described under

EntityConfig

https://datahubproject.io/docs/metadata-ingestion/source_docs/file_lineage/

shy-parrot-64120

03/10/2022, 8:54 PM

i mean thing about

anchoring

👍 1

Open in Slack

Previous Next