Hi <@U02DMDKSZME> very impressed with your file-ba...
# ingestion
s
Hi @lemon-hydrogen-83671 very impressed with your file-based lineage - very handy stuff for initial data bootstrap one question from my side regarding this source: does data file supports
yaml-anchors
? like this:
Copy code
version: 1
lineage:
  - entity: &dataset
      name: report.payment_reconciliation
      type: dataset
      platform: postgres
      platform_instance: mvp
    upstream:
      - entity: &datajob
          name: report.load_payment_reconciliation
          type: datajob
          platform: postgres
          platform_instance: mvp
  - entity:
      <<: *datajob
      name: report.load_payment_reconciliation
    upstream:
      - entity:
          <<: *dataset
          name: core.payment
      - entity:
          <<: *dataset
          name: core.ph2_transaction
      - entity:
          <<: *dataset
          name: core.ph2_order
afaiks answer is
no
have you any plans to do like this?
l
oh cool, to be honest i've never heard of yaml-anchors before this, i'd have to do some more research on it
s
in out product we’ve created our own plugin to parse from following structure (source/target definitions of plpgsql procedures)
Copy code
source:
  type: plpgsql
  config:
    database: mvp
    dbms: postgres
    pipelines:
      - group: payments
        entities:
          - name: report.load_payment_reconciliation
            sources:
              - core.payment
              - core.ph2_transaction
              - core.ph2_order
            target: report.payment_reconciliation
l
so yeah, long-story short i'd guess no 😛
s
this notation is pretty narrow (suitable for procedure to table lineage) however is much compact
l
yeah i can see that, makes it much cleaner!
I don't have any plans to work on it in the short term but it's open sourced so feel free to add on!
s
yq can do this )
Copy code
datahub % yq 'explode(.)' plpgsql_data.yml
is yaml processed by PyYAML?
l
yup. It makes use of the config_loader that the cli uses for recipes: https://github.com/linkedin/datahub/blob/d474387eeb2e092bdfc50654363bc5f6edc2b7a2/metadata-ingestion/src/datahub/configuration/yaml.py which is PyYaml i think.
s
seems like it wokrs - there was a typo in datafile
🙌 1
what types currently supported? are there any restrictions?
Copy code
ConfigurationError: Type must be one of ['dataset'], datajob is not yet supported.
dataset to dataset only
Copy code
---
version: 1
lineage:
  - entity: &dataset
      name: report.payment_reconciliation
      type: dataset
      platform: postgres
      platform_instance: mvp
    upstream:
      - entity:
          <<: *dataset
          name: core.payment
      - entity:
          <<: *dataset
          name: core.ph2_transaction
      - entity:
          <<: *dataset
          name: core.ph2_order
Copy code
dmytro.kulyk@MB-DAT-564087 datahub % datahub ingest -c plpgsql.yml
[2022-03-10 21:16:00,846] INFO     {datahub.cli.ingest_cli:70} - DataHub CLI version: 0.8.28.0
[2022-03-10 21:16:01,241] INFO     {datahub.cli.ingest_cli:86} - Starting metadata ingestion
[2022-03-10 21:16:01,246] INFO     {datahub.ingestion.source.metadata.lineage:175} - preserve_upstream is set to True
[2022-03-10 21:16:01,246] INFO     {datahub.ingestion.source.metadata.lineage:122} - Upstream detected for env='PROD' name='report.payment_reconciliation' type='dataset' platform='postgres' platform_instance='mvp'. Extracting urn...
[2022-03-10 21:16:01,911] INFO     {datahub.ingestion.run.pipeline:85} - sink wrote workunit lineage-urn:li:dataset:(urn:li:dataPlatform:postgres,mvp.report.payment_reconciliation,PROD)
[2022-03-10 21:16:01,911] INFO     {datahub.cli.ingest_cli:88} - Finished metadata ingestion

Source (datahub-lineage-file) report:
{'workunits_produced': 1,
 'workunit_ids': ['lineage-urn:li:dataset:(urn:li:dataPlatform:postgres,mvp.report.payment_reconciliation,PROD)'],
 'warnings': {},
 'failures': {},
 'cli_version': '0.8.28.0'}
Sink (datahub-rest) report:
{'records_written': 1,
 'warnings': [],
 'failures': [],
 'downstream_start_time': datetime.datetime(2022, 3, 10, 21, 16, 1, 475220),
 'downstream_end_time': datetime.datetime(2022, 3, 10, 21, 16, 1, 911311),
 'downstream_total_latency_in_seconds': 0.436091,
 'gms_version': 'v0.8.27'}
need to add this somewhere in documentation
@abundant-receptionist-6114 fyi
l
s
i mean thing about
anchoring
👍 1