DataHub #ingestion

early-island-92859

03/29/2023, 11:11 PM

Hello Team, I have a problem profiling of bigquery. I am using v0.10.0. I have two service accounts. One for Extractor project and other one for bigquery project. When I run it, I see the below errors for some tables during profiling. (here I am replacing the projectIds with BIGQUERY_PROJECT and EXTRACTOR_PROJECT to make it easy to read)

Copy code

[2023-03-21, 10:34:11 UTC] {ge_data_profiler.py:917} ERROR - Encountered exception while profiling BIGQUERY_PROJECT.dataset1.table1
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/google/cloud/bigquery/dbapi/cursor.py", line 203, in _execute
    self._query_job.result()
  File "/home/airflow/.local/lib/python3.7/site-packages/google/cloud/bigquery/job/query.py", line 1499, in result
    do_get_result()
  File "/home/airflow/.local/lib/python3.7/site-packages/google/api_core/retry.py", line 288, in retry_wrapped_func
    on_error=on_error,
  File "/home/airflow/.local/lib/python3.7/site-packages/google/api_core/retry.py", line 190, in retry_target
    return target()
  File "/home/airflow/.local/lib/python3.7/site-packages/google/cloud/bigquery/job/query.py", line 1489, in do_get_result
    super(QueryJob, self).result(retry=retry, timeout=timeout)
  File "/home/airflow/.local/lib/python3.7/site-packages/google/cloud/bigquery/job/base.py", line 728, in result
    return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
  File "/home/airflow/.local/lib/python3.7/site-packages/google/api_core/future/polling.py", line 137, in result
    raise self._exception
google.api_core.exceptions.NotFound: 404 Not found: Dataset EXTRACTOR_PROJECT:table1 was not found in location US

Location: US
Job ID: 11111111-1111-1111-1111-111111111111


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 912, in _generate_single_profile
    cursor.execute(bq_sql)
  File "/home/airflow/.local/lib/python3.7/site-packages/google/cloud/bigquery/dbapi/_helpers.py", line 494, in with_closed_check
    return method(self, *args, **kwargs)
  File "/home/airflow/.local/lib/python3.7/site-packages/google/cloud/bigquery/dbapi/cursor.py", line 167, in execute
    formatted_operation, parameters, job_id, job_config, parameter_types
  File "/home/airflow/.local/lib/python3.7/site-packages/google/cloud/bigquery/dbapi/cursor.py", line 205, in _execute
    raise exceptions.DatabaseError(exc)
google.cloud.bigquery.dbapi.exceptions.DatabaseError: 404 Not found: Dataset EXTRACTOR_PROJECT:table1 was not found in location US

When I look at the jobId which was failed in EXTRACTOR_PROJECT, I see that there is no projectID in front of the datasetId. So I believe bigquery looks for the table in the EXTRACTOR_PROJECT and returns 404 because that table is in BIGQUERY_PROJECT.

Copy code

SELECT * FROM `dataset1.table1` LIMIT 10000

for comparison I look at the successful jobIds and I see that projectId is added before datasetid.

Copy code

SELECT
    *
FROM
    `BIGQUERY_PROJECT.dataset2.table2`
WHERE
    DATE(`date`) BETWEEN DATE('2023-03-20 00:00:00') AND DATE('2023-03-21 00:00:00')

Why datahub is not adding BIGQUERY_PROJECT for all queries? Can someone help me to resolve it?

witty-motorcycle-52108

03/29/2023, 11:41 PM

hey all! wondering if it's possible to "backfill" stateful ingestion, i.e turn it on after an ingestion source has been set up and running for a while. i enabled it, but none of the entities are being removed so i'm somewhat assuming that it's because there's no state with those entities, so it doesnt know it should remove them

red-plumber-64268

03/30/2023, 7:53 AM

Hello friends! I am trying to get started with datahub, for ingesting bigquery data, but I am getting repeated errors like the one below. The service account I am using has Logs view Accessor and BigQuery admin role, which should be enough? Do you have any ideas as to what could be going wrong?

Copy code

⏳ Pipeline running successfully so far; produced 157 events in 1 minute and 2 seconds.
[2023-03-30 07:48:19,731] ERROR    {datahub.ingestion.source.bigquery_v2.bigquery:636} - Traceback (most recent call last):
  File "/tmp/datahub/ingest/venv-bigquery-0.10.1/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/bigquery.py", line 626, in _process_project
    yield from self._process_schema(
  File "/tmp/datahub/ingest/venv-bigquery-0.10.1/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/bigquery.py", line 782, in _process_schema
    yield from self._process_view(
  File "/tmp/datahub/ingest/venv-bigquery-0.10.1/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/bigquery.py", line 882, in _process_view
    yield from self.gen_view_dataset_workunits(
  File "/tmp/datahub/ingest/venv-bigquery-0.10.1/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/bigquery.py", line 951, in gen_view_dataset_workunits
    yield from self.gen_dataset_workunits(
  File "/tmp/datahub/ingest/venv-bigquery-0.10.1/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/bigquery.py", line 1007, in gen_dataset_workunits
    lastModified=TimeStamp(time=int(table.last_altered.timestamp() * 1000))
AttributeError: 'int' object has no attribute 'timestamp'

[2023-03-30 07:48:19,731] ERROR    {datahub.ingestion.source.bigquery_v2.bigquery:637} - Unable to get tables for dataset dashboards in project annotell-com, skipping. Does your service account has bigquery.tables.list, bigquery.routines.get, bigquery.routines.list permission, bigquery.tables.getData permission? The error was: 'int' object has no attribute 'timestamp'

salmon-angle-92685

03/30/2023, 10:10 AM

Hello guys, I am trying to keep the track of which applications request which tables on redshift. I was asking myself, then, if there isn't a way of ingesting the different API calls done by Cubejs into Datahub. The idea was for being able to see, on the lineage, the

Table -> View -> Application

track. Is there anyway of doing this? Thank you so much for your help!

salmon-angle-92685

03/30/2023, 10:15 AM

I am also trying to retrieve which tables are the most requested ones database-side. I've seen that we have the amount of requested queries and, as well, the top 5 recent queries available. Is there a way of listing more than 5 recent queries? Is there also any way of filtering the tables by the most requested in Datahub ? Thanks

boundless-nail-65912

03/30/2023, 11:35 AM

Hello Team, Does the datahub supports ml models and stored procedures in any of the database?

proud-dusk-671

03/30/2023, 11:52 AM

Hello Team, Can you tell me if AWS MSK is supported for ingestion by Datahub. Please also share relevant doc

salmon-angle-92685

03/30/2023, 12:12 PM

Hello Guys, I saw this article about Graph Service Implementation https://datahubproject.io/docs/how/migrating-graph-service-implementation/ . But I cannot find en explanation on how to access this Graph Representation of Datahub data. Could u guys help me? Thanks !

✅ 1

enough-noon-12106

03/30/2023, 12:44 PM

Do anyone know how we can add this

Last synchronized *14 hours ago*

or lastModifed using python emitter in push based ingestion

✅ 1

cool-tiger-42613

03/30/2023, 1:07 PM

Hello, for a custom data pipeline, how can stateful-ingestion be enabled. Whats the best way to create checkpoints, are there some examples in git for this?

rich-state-73859

03/30/2023, 4:55 PM

I got this error when ingesting from dbt source, could anyone help me with this?

Copy code

failed to write record with workunit urn:li:assertion:c3675908211ca5988d94475197414b71-assertionInfo with ('Unable to emit metadata to DataHub GMS', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:500]: javax.persistence.PersistenceException: Error when batch flush on sql: insert into metadata_aspect_v2 (urn, aspect, version, metadata, createdOn, createdBy, createdFor, systemmetadata) values (?,?,?,?,?,?,?,?)\n\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:42)\n\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:50)', 'message': 'javax.persistence.PersistenceException: Error when batch flush on sql: insert into metadata_aspect_v2 (urn, aspect, version, metadata, createdOn, createdBy, createdFor, systemmetadata) values (?,?,?,?', 'status': 500, 'id': 'urn:li:assertion:c3675908211ca5988d94475197414b71'}) and info {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:500]: javax.persistence.PersistenceException: Error when batch flush on sql: insert into metadata_aspect_v2 (urn, aspect, version, metadata, createdOn, createdBy, createdFor, systemmetadata) values (?,?,?,?,?,?,?,?)\n\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:42)\n\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:50)', 'message': 'javax.persistence.PersistenceException: Error when batch flush on sql: insert into metadata_aspect_v2 (urn, aspect, version, metadata, createdOn, createdBy, createdFor, systemmetadata) values (?,?,?,?', 'status': 500, 'id': 'urn:li:assertion:c3675908211ca5988d94475197414b71'}

lemon-scooter-69730

03/31/2023, 12:17 PM

Using the python SDK are you able to retrieve all datasets with a given platform (e.g, get all datasets in bigquery)

lively-spring-5482

03/31/2023, 2:26 PM

Hello, I’m using the csv-enricher source on DataHub v 0.10.1 and it works pretty nice. I’ve noticed that the OVERRIDE mode works as expected replacing the original list of terms or tags with a new one. However, I don’t understand how I could actually remove all the tags/terms with the tool? Submitting an empty list doesn’t seem to help much. I tried leaving an empty field, sending an empty array or a NULL value - to no avail. Is there something I miss or it is simply impossible to reset a list of tags/terms on a dataset once it was initially populated? Thanks in advance for the info 🙂 CSV record:

Copy code

resource,subresource,glossary_terms,tags,owners,ownership_type,description,domain 
"urn:li:dataset:(urn:li:dataPlatform:snowflake,prd_dwh.test_schema.h_test,PROD)",,[],,,,,,

Exception thrown:

Copy code

{'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
                        'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: Failed to validate record with class '
                                      'com.linkedin.common.GlossaryTerms: ERROR :: /terms/0/urn :: "Provided urn " is invalid\n'
                                      '\n'

✅ 1

tall-vr-26334

03/31/2023, 2:48 PM

Hi guys, I'm trying to connect to S3 using the aws_role configuration properties but I'm getting the following error. Does my source configuration look correctly formatted, does anyone have an example of how to do this?

Copy code

source:
    type: s3
    config:
        platform: s3
        path_specs:
            -
                include: 'my_bucket_name'
        aws_config:
            aws_region: my-region
            aws_role:
                -
                    RoleArn: 'arn'
                    ExternalId: <external_id>

Here is the error I'm getting

Copy code

~~~~ Execution Summary - RUN_INGEST ~~~~
Execution finished with errors.
{'exec_id': '3ad02c48-58b6-4ff9-9a23-a7ea2268f308',
 'infos': ['2023-03-31 14:42:26.606317 INFO: Starting execution for task with name=RUN_INGEST',
           "2023-03-31 14:42:34.708565 INFO: Failed to execute 'datahub ingest'",
           '2023-03-31 14:42:34.708757 INFO: Caught exception EXECUTING task_id=3ad02c48-58b6-4ff9-9a23-a7ea2268f308, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 231, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
 'errors': []}

~~~~ Ingestion Report ~~~~
{
  "cli": {
    "cli_version": "0.10.1",
    "cli_entry_location": "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/__init__.py",
    "py_version": "3.10.10 (main, Mar 14 2023, 02:37:11) [GCC 10.2.1 20210110]",
    "py_exec_path": "/tmp/datahub/ingest/venv-s3-0.10.1/bin/python3",
    "os_details": "Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.31",
    "peak_memory_usage": "167.37 MB",
    "mem_info": "167.37 MB"
  },
  "source": {
    "type": "s3",
    "report": {
      "events_produced": 0,
      "events_produced_per_sec": 0,
      "entities": {},
      "aspects": {},
      "warnings": {},
      "failures": {},
      "filtered": [],
      "start_time": "2023-03-31 14:42:29.926640 (2.33 seconds ago)",
      "running_time": "2.33 seconds"
    }
  },
  "sink": {
    "type": "datahub-rest",
    "report": {
      "total_records_written": 0,
      "records_written_per_second": 0,
      "warnings": [],
      "failures": [],
      "start_time": "2023-03-31 14:42:29.031984 (3.22 seconds ago)",
      "current_time": "2023-03-31 14:42:32.252517 (now)",
      "total_duration_in_seconds": 3.22,
      "gms_version": "v0.10.1",
      "pending_requests": 0
    }
  }
}

~~~~ Ingestion Logs ~~~~
Obtaining venv creation lock...
Acquired venv creation lock
venv setup time = 0
This version of datahub supports report-to functionality
datahub  ingest run -c /tmp/datahub/ingest/3ad02c48-58b6-4ff9-9a23-a7ea2268f308/recipe.yml --report-to /tmp/datahub/ingest/3ad02c48-58b6-4ff9-9a23-a7ea2268f308/ingestion_report.json
[2023-03-31 14:42:28,995] INFO     {datahub.cli.ingest_cli:173} - DataHub CLI version: 0.10.1
[2023-03-31 14:42:29,035] INFO     {datahub.ingestion.run.pipeline:184} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://datahub-gms:8080>
[2023-03-31 14:42:29,599] ERROR    {logger:26} - Please set env variable SPARK_VERSION
[2023-03-31 14:42:29,600] INFO     {logger:27} - Using deequ: com.amazon.deequ:deequ:1.2.2-spark-3.0
[2023-03-31 14:42:30,227] INFO     {datahub.ingestion.run.pipeline:201} - Source configured successfully.
[2023-03-31 14:42:30,230] INFO     {datahub.cli.ingest_cli:129} - Starting metadata ingestion
[2023-03-31 14:42:32,253] INFO     {datahub.ingestion.reporting.file_reporter:52} - Wrote UNKNOWN report successfully to <_io.TextIOWrapper name='/tmp/datahub/ingest/3ad02c48-58b6-4ff9-9a23-a7ea2268f308/ingestion_report.json' mode='w' encoding='UTF-8'>
[2023-03-31 14:42:32,253] INFO     {datahub.cli.ingest_cli:134} - Source (s3) report:
{'events_produced': 0,
 'events_produced_per_sec': 0,
 'entities': {},
 'aspects': {},
 'warnings': {},
 'failures': {},
 'filtered': [],
 'start_time': '2023-03-31 14:42:29.926640 (2.33 seconds ago)',
 'running_time': '2.33 seconds'}
[2023-03-31 14:42:32,254] INFO     {datahub.cli.ingest_cli:137} - Sink (datahub-rest) report:
{'total_records_written': 0,
 'records_written_per_second': 0,
 'warnings': [],
 'failures': [],
 'start_time': '2023-03-31 14:42:29.031984 (3.22 seconds ago)',
 'current_time': '2023-03-31 14:42:32.253876 (now)',
 'total_duration_in_seconds': 3.22,
 'gms_version': 'v0.10.1',
 'pending_requests': 0}
[2023-03-31 14:42:32,543] ERROR    {datahub.entrypoints:192} - Command failed: 'NoneType' object has no attribute 'access_key'
Traceback (most recent call last):
  File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/entrypoints.py", line 179, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
  File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 379, in wrapper
    raise e
  File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 334, in wrapper
    res = func(*args, **kwargs)
  File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/utilities/memory_leak_detector.py", line 95, in wrapper
    return func(ctx, *args, **kwargs)
  File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 198, in run
    loop.run_until_complete(run_func_check_upgrade(pipeline))
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 158, in run_func_check_upgrade
    ret = await the_one_future
  File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 149, in run_pipeline_async
    return await loop.run_in_executor(
  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 140, in run_pipeline_to_completion
    raise e
  File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 132, in run_pipeline_to_completion
    pipeline.run()
  File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 339, in run
    for wu in itertools.islice(
  File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/ingestion/source/s3/source.py", line 744, in get_workunits
    for file, timestamp, size in file_browser:
  File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/ingestion/source/s3/source.py", line 656, in s3_browser
    s3 = self.source_config.aws_config.get_s3_resource(
  File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/ingestion/source/aws/aws_common.py", line 183, in get_s3_resource
    resource = self.get_session().resource(
  File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/ingestion/source/aws/aws_common.py", line 139, in get_session
    "AccessKeyId": current_credentials.access_key,
AttributeError: 'NoneType' object has no attribute 'access_key'

salmon-angle-92685

03/31/2023, 3:28 PM

Hello, Is there a way to ingest the Application endpoint from where the Redshift tables are requested ? I have already all the redshift tables on Datahub, but I would love to have the application name as well on the lineage. Thanks !

loud-application-42754

03/31/2023, 5:49 PM

Hello all! Trying to ingest custom data formats instead of the standard [csv, tsv, etc..] when doing an ingest from s3. Is there a way to add a new file format with a schema we define?

lively-dusk-19162

04/01/2023, 7:26 AM

Hello all, I have created a new entity and ingested some data for the entity. The aspects have been ingested in database but I am unable to view entities in UI. I have updated graphql code too.my browse path is giving null. Can anyone please help me in resolving this error?

mysterious-table-75773

04/02/2023, 8:54 PM

Hey, is there an option to run ingestion via API and not just the RUN button in the GUI and the cron scheduler?

✅ 1

bitter-evening-61050

04/03/2023, 6:17 AM

Hi Team, I am trying to enable sso login in datahub web page using azure ad. Can anyone please help me with this

lively-raincoat-33818

04/03/2023, 8:11 AM

Hello everyone, I am ingesting 'dbt source freshness' with the 'source.json' file in Datahub. I have uploaded the file to S3 and I have modified the ingestion in datahub. But it is not showing the data correctly, the 'Last observed' says that it was 15 hours ago which coincides with the hr of the last datahub ingestion. But now that I'm loading the 'sources.json' it should show me the freshness data. Has anyone had this problem and can help me? I have the V0.10.0. Thanks

✅ 1

gifted-diamond-19544

04/03/2023, 11:13 AM

Hello. I have a question about ingestion. We currently have Datahub deployed and ingesting metadata from AWS Glue (among others). Ingestion was setup via the UI. My question is, if one database is deleted from Glue, will the ingestion process delete this database’s metadata from Datahub, or will this Database still exist on Datahub and we need to delete it manually? Currently using v0.10.0. Thank you

✅ 3

lively-night-46534

04/03/2023, 11:23 AM

Hey, is column level lineage supported for bigquery yet?

gray-angle-76914

04/03/2023, 1:36 PM

hello! I am trying to do an ingest from snowflake using key-pair authentication, but I get this error:

Copy code

~~~~ Execution Summary ~~~~

RUN_INGEST - {'errors': [],
 'exec_id': '6619ef27-9cba-4790-ac3f-538f8a4c3c08',
 'infos': ['2023-04-03 11:07:52.379075 [exec_id=6619ef27-9cba-4790-ac3f-538f8a4c3c08] INFO: Starting execution for task with name=RUN_INGEST',
           '2023-04-03 11:07:52.390782 [exec_id=6619ef27-9cba-4790-ac3f-538f8a4c3c08] INFO: Caught exception EXECUTING '
           'task_id=6619ef27-9cba-4790-ac3f-538f8a4c3c08, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 70, in execute\n'
           '    recipe: dict = SubProcessTaskUtil._resolve_recipe(validated_args.recipe, ctx, self.ctx)\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_task_common.py", line 107, in _resolve_recipe\n'
           '    json_recipe = json.loads(resolved_recipe)\n'
           '  File "/usr/local/lib/python3.10/json/__init__.py", line 346, in loads\n'
           '    return _default_decoder.decode(s)\n'
           '  File "/usr/local/lib/python3.10/json/decoder.py", line 337, in decode\n'
           '    obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n'
           '  File "/usr/local/lib/python3.10/json/decoder.py", line 353, in raw_decode\n'
           '    obj, end = self.scan_once(s, idx)\n'
           'json.decoder.JSONDecodeError: Invalid control character at: line 1 column 1027 (char 1026)\n']}
Execution finished with errors.

I am using the following recipe:

Copy code

source:
    type: snowflake
    config:
        stateful_ingestion:
            enabled: false
        env: DEV
        platform_instance: <platform>
        authentication_type: KEY_PAIR_AUTHENTICATOR
        private_key_password: '${snf_dev_pkp}'
        private_key: '${snf_dev_pk_str}'
        convert_urns_to_lowercase: true
        include_external_url: true
        database_pattern:
            ignoreCase: true
        include_technical_schema: true
        include_tables: false
        include_table_lineage: false
        include_table_location_lineage: true
        include_views: false
        include_view_linage: true
        include_column_lineage: true
        ignore_start_time_lineage: true
        include_usage_stats: true
        store_last_usage_extraction_timestamp: true
        top_n_queries: 10
        include_top_n_queries: true
        format_sql_queries: true
        include_operational_stats: true
        include_read_operational_stats: false
        apply_view_usage_to_tables: true
        email_domain: none
        store_last_profiling_timestamps: true
        profiling:
            enabled: false
            turn_off_expensive_profiling_metrics: false
            profile_table_level_only: true
            include_field_null_count: true
            include_field_distinct_count: true
            include_field_min_value: true
            include_field_max_value: true
            include_field_mean_value: true
            include_field_median_value: true
            include_field_stddev_value: true
            include_field_quantiles: true
            include_field_distinct_value_frequencies: true
            include_field_histogram: true
            include_field_sample_values: true
            field_sample_values_limit: 4
            query_combiner_enabled: true
sink:
    type: datahub-rest
    config:
        server: <server>

where private_key_password and private_key have been added as secrets. Any idea what could be the error? Thanks!

colossal-finland-28298

04/04/2023, 4:23 AM

Hello, Team! 👋🏻 I’d like to ingest from PostgreSQL database to my own local(Mac) Datahub running on docker with “profiling” options. I just wanna gather a few(not full) of “Sample Values” ONLY not include any stats even though count of distinct columns and rows. I have 0.10.1 version of acryl datahub and the recipe I used is that below.

Copy code

source:
  type: postgres
  config:
    host_port: localhost:5432
    username: postgres
    # database:
    # password:
    include_tables: true
    include_views: true
    schema_pattern:
        deny:
            - information_schema
            - pg_catalog
    profiling:
        enabled: true
        include_field_distinct_count: false
        include_field_min_value: false
        include_field_median_value: false
        include_field_max_value: false
        include_field_mean_value: false       
        include_field_stddev_value: false 
        partition_profiling_enabled: false
        catch_exceptions: false
        query_combiner_enabled: false
        include_field_null_count: false
        field_sample_values_limit: 10
        include_field_distinct_value_frequencies: false
        include_field_histogram: false
        include_field_quantiles: false
        query_combiner_enabled: false
        profile_table_row_count_estimate_only: true
        turn_off_expensive_profiling_metrics: true

include_field_sample_values

default value is “true” so I did not put that option in the recipe. I found in ingest log that it has still gathering the count of distinct even turn all options of profiling off except the one which is getting sample values(data).

Copy code

2023-03-31 18:34:32,666 INFO sqlalchemy.engine.Engine [cached since 0.3957s ago] {}
2023-03-31 18:34:32,668 INFO sqlalchemy.engine.Engine SELECT count(distinct(bid)) AS count_1 
FROM public.pgbench_branches

How can I turn distinct count off in profiling mode? Thank you in advance!

curved-planet-99787

04/04/2023, 6:53 AM

Hi all, I'm having problems with ingesting Tableau resources with the recent release (

0.10.1

). After the successful login to Tableau and the retrieval of all projects, the ingestion starts querying the metadata API which ends with the following error:

Copy code

2023-04-04 06:42:00,701 - [INFO] - [tableau.endpoint.datasources:73] - Querying all datasources on site
2023-04-04 06:42:00,795 - [INFO] - [tableau.endpoint.metadata:61] - Querying Metadata API
2023-04-04 06:42:00,868 - [ERROR] - [datahub.ingestion.run.pipeline:389] - Caught error
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 339, in run
    for wu in itertools.islice(
  File "/usr/local/lib/python3.10/site-packages/datahub/utilities/source_helpers.py", line 90, in auto_stale_entity_removal
    for wu in stream:
  File "/usr/local/lib/python3.10/site-packages/datahub/utilities/source_helpers.py", line 41, in auto_status_aspect
    for wu in stream:
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/tableau.py", line 2247, in get_workunits_internal
    yield from self.emit_workbooks()
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/tableau.py", line 708, in emit_workbooks
    for workbook in self.get_connection_objects(
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/tableau.py", line 687, in get_connection_objects
    ) = self.get_connection_object_page(
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/tableau.py", line 650, in get_connection_object_page
    raise RuntimeError(f"Query {connection_type} error: {errors}")
RuntimeError: Query workbooksConnection error: [{'message': "Validation error of type FieldUndefined: Field 'projectLuid' in type 'Workbook' is undefined @ 'workbooksConnection/nodes/projectLuid'", 'locations': [{'line': 9, 'column': 7, 'sourceName': None}], 'description': "Field 'projectLuid' in type 'Workbook' is undefined", 'validationErrorType': 'FieldUndefined', 'queryPath': ['workbooksConnection', 'nodes', 'projectLuid'], 'errorType': 'ValidationError', 'extensions': None, 'path': None}]

Can someone help me by pointing me to the potential root cause? I can't really trace the problem here, but it looks like a DataHub internal issue to me

plus1 1

purple-microphone-86243

04/04/2023, 7:58 AM

Hi Team, I'm trying to create a custom ingestion source and integrate it with my local datahub running on docker. I've created one as per the docs here: 1. https://datahubproject.io/docs/metadata-ingestion/adding-source/ 2. https://datahubproject.io/docs/how/add-custom-ingestion-source/ 3. https://github.com/acryldata/meta-world I used the meta-world example for my ingestion. Pip installed the meta-world package and tried running the recipe.yaml file (using datahub ingest cmd) but I'm getting "Failed to find a registered source: No module named my-souce". Can someone help me?

most-animal-32096

04/04/2023, 3:47 PM

Hello, I just opened the issue #7748 about the discrepancy I spotted between the Python-based

metadata-ingestion

layer and

DataProcessInstance

entity schema/aspects definition, that leads to a

error on REST

POST

call to

datahub-gms

. Feel free to comment with your opinion of any possible solution (as there are 2 possible sides to fix it).

✅ 1

green-lion-58215

04/04/2023, 9:46 PM

Hi team, Has there been any changes to how the redshift ingestion works for package version 0.9.0? I have been running the ingestion successfully for last few days and from March 29 onwards, some of my schema ingestion is failing due to a memory issue. Seems like the ingestion is running for a lot longer than I expected. I had this issue before and I explicitly set the “include_table_lineage”: False and it was running fine. I did not make any code changes and I am confused as to why it is failing now. I am explicitly using the 0.9.0 version of package. Any help on this is much appreciated.

steep-waitress-15973

04/05/2023, 3:01 AM

Hi all, anyone had experience in ingesting metadata from Microsoft Dataverse? Any recipe?

acceptable-midnight-32657

04/05/2023, 8:29 AM

Hi folks! Is it possible to add a new entity (dataset or dashboard) by ingesting it from a file (json, yaml, csv, etc.)? Cann't find the answer in the documentation. Since there is an option to ingest sample data, it must be some API for that. Where can I find the source with that API description (and file formats for feeding that API)?

✅ 1