Hello, having trouble with datahub-lineage-file Tr...
# troubleshoot
s
Hello, having trouble with datahub-lineage-file Try to add sample data from https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/bootstrap_data/file_lineage.yml And getting error (in thread)
Copy code
[2022-09-01 09:16:21,514] ERROR    {datahub.ingestion.run.pipeline:54} -  failed to write record with workunit lineage-urn:li:dataset:(urn:li:dataPlatform:kafka,topic3,DEV) with Expecting value: line 1 column 1 (char 0) and info {}
[2022-09-01 09:16:21,522] ERROR    {datahub.ingestion.run.pipeline:54} -  failed to write record with workunit lineage-urn:li:dataset:(urn:li:dataPlatform:kafka,topic2,DEV) with Expecting value: line 1 column 1 (char 0) and info {}
[2022-09-01 09:16:21,528] INFO     {datahub.cli.ingest_cli:143} - Finished metadata ingestion

Cli report:
{'cli_entry_location': '/usr/local/lib/python3.7/site-packages/datahub/__init__.py',
 'cli_version': '0.8.43.6',
 'os_details': 'Linux-5.10.130-118.517.amzn2.x86_64-x86_64-with-glibc2.2.5',
 'py_exec_path': '/usr/bin/python3',
 'py_version': '3.7.10 (default, Jun  3 2021, 00:02:01) \n[GCC 7.3.1 20180712 (Red Hat 7.3.1-13)]'}
Source (datahub-lineage-file) report:
{'event_ids': ['lineage-urn:li:dataset:(urn:li:dataPlatform:kafka,topic3,DEV)', 'lineage-urn:li:dataset:(urn:li:dataPlatform:kafka,topic2,DEV)'],
 'events_produced': '2',
 'events_produced_per_sec': '0',
 'failures': {},
 'read_rate': '0',
 'running_time_in_seconds': '0',
 'start_time': '2022-09-01 09:16:21.486920',
 'warnings': {}}
Sink (datahub-rest) report:
{'current_time': '2022-09-01 09:16:21.695982',
 'failures': [{'e': 'Expecting value: line 1 column 1 (char 0)'}, {'e': 'Expecting value: line 1 column 1 (char 0)'}],
 'gms_version': 'v0.8.43',
 'pending_requests': '0',
 'records_written_per_second': '0',
 'start_time': '2022-09-01 09:16:20.645484',
 'total_duration_in_seconds': '1.05',
 'total_records_written': '0',
 'warnings': []}
b
you need to specify the config, not pass the lineage file as the config
Copy code
source:
  type: datahub-lineage-file
  config:
    # Coordinates
    file: /path/to/file_lineage.yml
    # Whether we want to query datahub-gms for upstream data
    preserve_upstream: False

sink:
# sink configs
s
hmm.. i thougth i did correct… first i created file named file_linage.yml
Copy code
---
version: 1
lineage:
  - entity:
      name: topic3
      type: dataset
      env: DEV
      platform: kafka
    upstream:
      - entity:
          name: topic2
          type: dataset
          env: DEV
          platform: kafka
      - entity:
          name: topic1
          type: dataset
          env: DEV
          platform: kafka
  - entity:
      name: topic2
      type: dataset
      env: DEV
      platform: kafka
    upstream:
      - entity:
          name: kafka.topic2
          env: PROD
          platform: snowflake
          platform_instance: test
          type: dataset
than i create file example_linage.yml (config?)
Copy code
source:
  type: datahub-lineage-file
  config:
    # Coordinates
    file: /opt/datahub/gms_data/file_linage.yml
    # Whether we want to query datahub-gms for upstream data
    preserve_upstream: False
sink:
    type: datahub-rest
    config:
        server: '<http://localhost:8080>'
and than i launch datahub ingest -c example_linage.yml (using config file^ as I thought) what am I doing wrong?
b
hmm ok my bad i didnt notice the config file you posted