DataHub #ingestion

purple-terabyte-64712

05/10/2023, 7:31 AM

Hi, I have an issue with Parquet ingestion. When I execute an ingestion then every Parquet file will be a standalone ingestion, and the {table}, {partition_key} and {partition} marks has no effects. "c:/users/szger/PARQUET/{table}/{partition_key[0]}={partition[0]}/*.parquet" It is creating more hundred entries in the output file, one for every parquet file. I attach the sample parquet folders too. What I expecting is that every table has only one schema definition in the file.

parquet_discovery_output.json PARQUET.7z

✅ 1

colossal-tent-57599

05/10/2023, 8:07 AM

Hello all, I see that the ingestion with Snowflake source is always in pending state. It used to work before but not anymore. Attached is the gms log. I am running version v0.9.6 on EC2 via docker quickstart. Thanks for help!

gms_log_0905.log

bland-orange-13353

05/10/2023, 12:56 PM

This message was deleted.

✅ 1

loud-librarian-93625

05/10/2023, 1:29 PM

Hi all, I'm trying to ingest from Tableau server

datahub ingest -c 'C:\Users\matt.evans\.datahub\tableau\tableau.dhub.yaml' --dry-run

but am getting the following error

Copy code

File "C:\Users\matt.evans\AppData\Local\Programs\Python\Python310\lib\site-packages\datahub\configuration\config_loader.py", line 101, in load_config_file
    raise ConfigurationError(
datahub.configuration.common.ConfigurationError: Cannot read remote file C:\Users\matt.evans\.datahub\tableau\tableau.dhub.yaml, error:No connection adapters were found for 'C:\\Users\\matt.evans\\.datahub\\tableau\\tableau.dhub.yaml'

Any idea what I'm doing wrong? Seems to be something in the yaml file it doesn't like.

Console Output.txt tableau.dhub.yaml

bland-orange-13353

05/10/2023, 3:31 PM

This message was deleted.

✅ 1

bland-orange-13353

05/10/2023, 3:39 PM

This message was deleted.

✅ 1

rapid-spoon-75609

05/10/2023, 9:25 PM

Does anyone know if it’s possible for datahub to include Avro schema metadata (tags) as a datahub tag on the kafka topic resource? For example, taking the schema in the thread, and using the

team

tag in metadata as a tag on the Kafka topic

✅ 1

📖 1

🔍 1

powerful-answer-39247

05/11/2023, 2:33 AM

Posstgres ingestion run fails with this error

Copy code

File "/tmp/datahub/ingest/venv-postgres-0.10.2/lib/python3.10/site-packages/datahub/ingestion/source/state_provider/datahub_ingestion_checkpointing_provider.py", line 76, in get_latest_checkpoint
    ] = self.graph.get_latest_timeseries_value(
  File "/tmp/datahub/ingest/venv-postgres-0.10.2/lib/python3.10/site-packages/datahub/ingestion/graph/client.py", line 299, in get_latest_timeseries_value
    assert len(values) == 1
AssertionError

important-area-90857

05/11/2023, 5:30 AM

Hi, guys How should i use "fldUrn", where can i find some example or documents?

🔍 1

📖 1

loud-hospital-37195

05/11/2023, 7:15 AM

Hi, I am testing the file based lineage and the lineage that appears in the datahub interface only shows me two tables, I attach the .yaml and what appears in datahub. Does anyone know what I am doing wrong? version: 1 lineage: - entity: name: son_1 type: dataset env: UAT platform: snowflake upstream: - entity: name: dad type: dataset env: UAT platform: oracle - entity: name: mom type: dataset env: UAT platform: kafka - entity: name: dad_1 type: dataset env: UAT platform: oracle upstream: - entity: name: grandpa_of_dad type: dataset env: UAT platform: snowflake - entity: name: grandma_of_dad type: dataset env: UAT platform: oracle

numerous-refrigerator-15664

05/11/2023, 7:19 AM

Hi team, I'm testing to ingest column-level lineage between my hive datasets. I've ingested fine-grained lineage, in dataset-datajob-dataset form, following the example here (https://datahubproject.io/docs/generated/metamodel/entities/dataset/#fine-grained-lineage). I expected to see column lineage, but column lineage toggle shows nothing. I can see table level lineage though. (dataset - datajob - dataset) Also I can see new record with aspect='dataJobInputOutput' is created in mysql metadata_aspect_v2. Should I make dataset-dataset form lineage to see column-level lineage? Or did I miss something in making dataset-datajob-dataset lineage?

🔍 1

📖 1

mysterious-table-75773

05/11/2023, 9:02 AM

is there a way to run datahub without datahub-actions? it contains tens of critical vulnerabilities

📖 1

🔍 1

✅ 1

delightful-painter-8227

05/11/2023, 10:04 AM

Hello! 👋 Can someone help me understanding why my ingestion is on pending status even after I restart the actions container? Thanks.

loud-hospital-37195

05/11/2023, 11:37 AM

lemon-scooter-69730

05/11/2023, 12:48 PM

I have discovered a versioning in the programmatic pipelines feature. It's as follows: When running a pipeline with the SDK

Copy code

pipeline = Pipeline.create(recipe)
pipeline.run()
pipeline.pretty_print_summary()

For example it throws this exception

Copy code

if regex("LATERAL VIEW EXPLODE(col)"):
TypeError: 'str' object is not callable

This error comes from

sqllineage

because it uses the latest version of

sqlparse==0.4.4

pinning my version to

0.4.3

fixed the problem. I also noticed that the version of

sqllineage==1.3.6

uses the as present here I resolved it by moving my version of sqllineage to

1.4.2

. I am just putting this here in case anyone runs into this issue... I spent the better part of an hour or two getting to the bottom of this.

lemon-scooter-69730

05/11/2023, 1:30 PM

When you set up a programatic pipeline should it show up in the UI?

📖 1

✅ 1

damp-orange-46267

05/11/2023, 3:01 PM

Hi guys, I’m trying to ingest data from tableau, but I’m getting this error :

Copy code

~~~~ Execution Summary - RUN_INGEST ~~~~
Execution finished with errors.
{'exec_id': '7985f351-d346-4713-b683-f256a1b24b0d',
 'infos': ['2023-05-11 14:55:13.610978 INFO: Starting execution for task with name=RUN_INGEST',
           "2023-05-11 14:55:17.687276 INFO: Failed to execute 'datahub ingest'",
           '2023-05-11 14:55:17.687583 INFO: Caught exception EXECUTING task_id=7985f351-d346-4713-b683-f256a1b24b0d, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 231, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
 'errors': []}

~~~~ Ingestion Logs ~~~~
Obtaining venv creation lock...
Acquired venv creation lock
venv setup time = 0
This version of datahub supports report-to functionality
datahub  ingest run -c /tmp/datahub/ingest/7985f351-d346-4713-b683-f256a1b24b0d/recipe.yml --report-to /tmp/datahub/ingest/7985f351-d346-4713-b683-f256a1b24b0d/ingestion_report.json
[2023-05-11 14:55:16,653] INFO     {datahub.cli.ingest_cli:165} - DataHub CLI version: 0.10.0
1 validation error for PipelineConfig
source -> sink
  extra fields not permitted (type=value_error.extra)

📖 1

✅ 1

limited-forest-73733

05/11/2023, 4:13 PM

Hey team i am not able to integrate airflow with datahub 0.10.2 using datahub kafka. Can anyone suggest which datahub version shall i use with airflow:2.4.3. Thanks in advance

📖 1

✅ 1

🔍 1

little-refrigerator-78584

05/12/2023, 1:21 PM

Hi Guys I was trying to ingest only the Jobs from the AWS Glue and used this recipe

Copy code

source:
    type: glue
    config:
        aws_region: eu-central-1
        platform: glue
        extract_transforms: True
        database_pattern: {'deny': ['.*']}
        table_pattern: {'deny': ['.*']}

It successfully pulled 2 jobs from aws and is showing the home page under Platform section. But when i click on it shows

No results found for ""

If i want to just pull my jobs from glue and not the tables and databases then it wont show it ?

bland-orange-13353

05/12/2023, 1:29 PM

This message was deleted.

✅ 1

purple-terabyte-64712

05/13/2023, 3:58 AM

Hi guys, can you help me with this issue? https://datahubspace.slack.com/archives/CUMUWQU66/p1683703900400409

📖 1

🔍 1

miniature-ghost-14229

05/13/2023, 1:20 PM

Hi everyone! I need some help. I am getting this error:

Copy code

Dataset query failed with error: 400 INFORMATION_SCHEMA.PARTITIONS query attempted to read too many tables. Please add more restrictive filters. Location: EU Job ID: fb2b9691-cb6bb7

I tried to filter this query and reduce the amount of data to fetch but it looks like didn't work. Does datahub parse the entire project? I need to ingest only a specific dataset, so I add a filter and included my dataset name in allow patterns but it looks that is not working or taking it in consideration. Thank you

📖 1

🔍 1

brave-room-48783

05/14/2023, 9:00 AM

Hey, getting this error while trying to ingest Metabase 0.41.2 (version as suggested in the docs - https://datahubproject.io/docs/generated/ingestion/sources/metabase/#compatibility) I have created a local metabase instance at localhost:3000 with the yaml recipe - source: type: metabase config: connect_uri: 'localhost:3000' username: admin_username password: admin_password DataHub CLI version: 0.10.2.3 Python version: 3.9.6 (default, Mar 10 2023, 201638) [Clang 14.0.3 (clang-1403.0.22.14.1)] Need a nudge in what I might be doing wrong here?

Copy code

~~~~ Execution Summary - RUN_INGEST ~~~~
Execution finished with errors.
{'exec_id': '0de3d15c-4e8d-45bf-8877-46e9c8c66de8',
 'infos': ['2023-05-14 08:45:07.246534 INFO: Starting execution for task with name=RUN_INGEST',
           "2023-05-14 08:45:11.476157 INFO: Failed to execute 'datahub ingest'",
           '2023-05-14 08:45:11.486188 INFO: Caught exception EXECUTING task_id=0de3d15c-4e8d-45bf-8877-46e9c8c66de8, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 231, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
 'errors': []}

~~~~ Ingestion Logs ~~~~
Obtaining venv creation lock...
Acquired venv creation lock
venv setup time = 0
This version of datahub supports report-to functionality
datahub --debug ingest run -c /tmp/datahub/ingest/0de3d15c-4e8d-45bf-8877-46e9c8c66de8/recipe.yml --report-to /tmp/datahub/ingest/0de3d15c-4e8d-45bf-8877-46e9c8c66de8/ingestion_report.json
[2023-05-14 08:45:08,814] DEBUG    {datahub.telemetry.telemetry:219} - Sending init Telemetry
[2023-05-14 08:45:10,004] DEBUG    {datahub.telemetry.telemetry:248} - Sending telemetry for function-call
[2023-05-14 08:45:10,417] INFO     {datahub.cli.ingest_cli:173} - DataHub CLI version: 0.10.2
[2023-05-14 08:45:10,582] DEBUG    {datahub.ingestion.sink.datahub_rest:116} - Setting env variables to override config
[2023-05-14 08:45:10,582] DEBUG    {datahub.ingestion.sink.datahub_rest:118} - Setting gms config
[2023-05-14 08:45:10,583] DEBUG    {datahub.ingestion.run.pipeline:203} - Sink type datahub-rest (<class 'datahub.ingestion.sink.datahub_rest.DatahubRestSink'>) configured
[2023-05-14 08:45:10,583] INFO     {datahub.ingestion.run.pipeline:204} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://datahub-gms:8080>
[2023-05-14 08:45:10,595] DEBUG    {datahub.ingestion.run.pipeline:278} - Reporter type:file,<class 'datahub.ingestion.reporting.file_reporter.FileReporter'> configured.
[2023-05-14 08:45:10,630] DEBUG    {datahub.telemetry.telemetry:248} - Sending telemetry for function-call
[2023-05-14 08:45:11,034] ERROR    {datahub.entrypoints:195} - Command failed: Failed to find a registered source for type metabase: 'str' object is not callable
Traceback (most recent call last):
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 119, in _add_init_error_context
    yield
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 214, in __init__
    source_class = source_registry.get(source_type)
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/datahub/ingestion/api/registry.py", line 173, in get
    tp = self._ensure_not_lazy(key)
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/datahub/ingestion/api/registry.py", line 117, in _ensure_not_lazy
    plugin_class = import_path(path)
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/datahub/ingestion/api/registry.py", line 48, in import_path
    item = importlib.import_module(module_name)
  File "/usr/local/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/datahub/ingestion/source/metabase.py", line 10, in <module>
    from sqllineage.runner import LineageRunner
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/sqllineage/__init__.py", line 41, in <module>
    _monkey_patch()
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/sqllineage/__init__.py", line 35, in _monkey_patch
    _patch_updating_lateral_view_lexeme()
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/sqllineage/__init__.py", line 24, in _patch_updating_lateral_view_lexeme
    if regex("LATERAL VIEW EXPLODE(col)"):
TypeError: 'str' object is not callable

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/datahub/entrypoints.py", line 182, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 379, in wrapper
    raise e
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 334, in wrapper
    res = func(*args, **kwargs)
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/datahub/utilities/memory_leak_detector.py", line 95, in wrapper
    return func(ctx, *args, **kwargs)
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 187, in run
    pipeline = Pipeline.create(
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 328, in create
    return cls(
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 211, in __init__
    with _add_init_error_context(
  File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 121, in _add_init_error_context
    raise PipelineInitError(f"Failed to {step}: {e}") from e
datahub.ingestion.run.pipeline.PipelineInitError: Failed to find a registered source for type metabase: 'str' object is not callable
[2023-05-14 08:45:11,040] DEBUG    {datahub.entrypoints:197} - DataHub CLI version: 0.10.2 at /tmp/datahub/ingest/venv-metabase-0.10.2/lib/python3.10/site-packages/datahub/__init__.py
[2023-05-14 08:45:11,040] DEBUG    {datahub.entrypoints:200} - Python version: 3.10.10 (main, Mar 14 2023, 03:08:22) [GCC 10.2.1 20210110] at /tmp/datahub/ingest/venv-metabase-0.10.2/bin/python3 on Linux-5.15.49-linuxkit-aarch64-with-glibc2.31
[2023-05-14 08:45:11,040] DEBUG    {datahub.entrypoints:205} - GMS config {'models': {}, 'patchCapable': True, 'versions': {'linkedin/datahub': {'version': 'v0.10.2', 'commit': '0fa983adc7370862371b4c0786aac0e3b81a563a'}}, 'managedIngestion': {'defaultCliVersion': '0.10.2', 'enabled': True}, 'statefulIngestionCapable': True, 'supportsImpactAnalysis': True, 'timeZone': 'GMT', 'telemetry': {'enabledCli': True, 'enabledIngestion': False}, 'datasetUrnNameCasing': False, 'retention': 'true', 'datahub': {'serverType': 'quickstart'}, 'noCode': 'true'}

🔍 1

📖 1

wide-ghost-47822

05/14/2023, 8:33 PM

Hi, I’ve just playing with a script which we can ingest data to Datahub programatically following this link -> https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/programatic_pipeline.py#enroll-beta At some point, I’ve figure it out there is a method called

log_ingestion_stats

in pipeline object. And I wondered if I can get some metrics about the pipeline which is runned. I saw some code block inside this method which sends some statistics data using telemetry object. It is like this:

Copy code

telemetry.telemetry_instance.ping(
    "ingest_stats",
    {
        "source_type": self.config.source.type,
        "sink_type": self.config.sink.type,
        "records_written": stats.discretize(
            self.sink.get_report().total_records_written
        ),
        "source_failures": stats.discretize(source_failures),
        "source_warnings": stats.discretize(source_warnings),
        "sink_failures": stats.discretize(sink_failures),
        "sink_warnings": stats.discretize(sink_warnings),
        "global_warnings": global_warnings,
        "failures": stats.discretize(source_failures + sink_failures),
        "warnings": stats.discretize(
            source_warnings + sink_warnings + global_warnings
        ),
    },

Inside the ping method, the code sends this data to an external api called Mixpanel. It seems you are collecting data about the pipeline from my machine. I don’t like this way of collecting data. Why are you collecting this data?

colossal-waitress-83487

05/15/2023, 1:57 AM

Hi Everyone, I want to use java to add a new dataset (mysql table), I found the following code but can't find how to add a table field,does anyone know how to add a table field? MetadataChangeProposalWrapper mcpw = MetadataChangeProposalWrapper.builder() .entityType("dataset") .entityUrn("urnlidataset:(urnlidataPlatform:mysql,test.test5,PROD)") .upsert() .aspect(new DatasetProperties().setDescription("test").setName("test5")) .build();

✅ 1

📖 1

🔍 1

clever-author-65853

05/15/2023, 1:19 PM

Hello! I’m trying to understand how the airflow ingestion works. Does Datahub ingest logs from airflow or we need to send events from the task it self?

🔍 1

📖 1

✅ 1

miniature-hair-20451

05/15/2023, 6:10 PM

Hi Added a new bug for delta lake ingestor and PR to resolve it. Please review. https://github.com/datahub-project/datahub/issues/8049

silly-intern-25190

05/16/2023, 5:12 AM

HI, during profiling, we faced this error, and it will be helpful if someone can explain this error and the possible cause for it.

{'error': 'Unable to emit metadata to DataHub GMS',

'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',

'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:400]: Cannot parse request entity\n'

'\tat com.linkedin.restli.server.RestLiServiceException.fromThrowable(RestLiServiceException.java:315)\n'

'\tat com.linkedin.restli.server.BaseRestLiServer.buildPreRoutingError(BaseRestLiServer.java:202)',

'message': 'Cannot parse request entity',

'status': 400,

'id': 'urn:li:dataset:(urn:li:dataPlatform:vertica_fresh,public.test_data1,PROD)'}},

silly-nest-50341

05/16/2023, 5:41 AM

Hi there, I am trying to add such lineage (dataset -> datajob -> dataset), but kept failing (I refered to this link). Adding lineage using python SDK was successful using mce_builder.make_lineage_mce, but seems this function only support dataset entity not datajob. Does python sdk currently support one easy api for adding (dataset -> datajob -> dataset) lineage? or can you give me any other way around? thanks

damp-orange-46267

05/16/2023, 9:50 AM

Hi guys do you have any experience with this error:

Copy code

PipelineInitError: Failed to find a registered source for type bigquery: 'str' object is not callable