https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • e

    early-island-92859

    03/29/2023, 11:11 PM
    Hello Team, I have a problem profiling of bigquery. I am using v0.10.0. I have two service accounts. One for Extractor project and other one for bigquery project. When I run it, I see the below errors for some tables during profiling. (here I am replacing the projectIds with BIGQUERY_PROJECT and EXTRACTOR_PROJECT to make it easy to read)
    Copy code
    [2023-03-21, 10:34:11 UTC] {ge_data_profiler.py:917} ERROR - Encountered exception while profiling BIGQUERY_PROJECT.dataset1.table1
    Traceback (most recent call last):
      File "/home/airflow/.local/lib/python3.7/site-packages/google/cloud/bigquery/dbapi/cursor.py", line 203, in _execute
        self._query_job.result()
      File "/home/airflow/.local/lib/python3.7/site-packages/google/cloud/bigquery/job/query.py", line 1499, in result
        do_get_result()
      File "/home/airflow/.local/lib/python3.7/site-packages/google/api_core/retry.py", line 288, in retry_wrapped_func
        on_error=on_error,
      File "/home/airflow/.local/lib/python3.7/site-packages/google/api_core/retry.py", line 190, in retry_target
        return target()
      File "/home/airflow/.local/lib/python3.7/site-packages/google/cloud/bigquery/job/query.py", line 1489, in do_get_result
        super(QueryJob, self).result(retry=retry, timeout=timeout)
      File "/home/airflow/.local/lib/python3.7/site-packages/google/cloud/bigquery/job/base.py", line 728, in result
        return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
      File "/home/airflow/.local/lib/python3.7/site-packages/google/api_core/future/polling.py", line 137, in result
        raise self._exception
    google.api_core.exceptions.NotFound: 404 Not found: Dataset EXTRACTOR_PROJECT:table1 was not found in location US
    
    Location: US
    Job ID: 11111111-1111-1111-1111-111111111111
    
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/home/airflow/.local/lib/python3.7/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 912, in _generate_single_profile
        cursor.execute(bq_sql)
      File "/home/airflow/.local/lib/python3.7/site-packages/google/cloud/bigquery/dbapi/_helpers.py", line 494, in with_closed_check
        return method(self, *args, **kwargs)
      File "/home/airflow/.local/lib/python3.7/site-packages/google/cloud/bigquery/dbapi/cursor.py", line 167, in execute
        formatted_operation, parameters, job_id, job_config, parameter_types
      File "/home/airflow/.local/lib/python3.7/site-packages/google/cloud/bigquery/dbapi/cursor.py", line 205, in _execute
        raise exceptions.DatabaseError(exc)
    google.cloud.bigquery.dbapi.exceptions.DatabaseError: 404 Not found: Dataset EXTRACTOR_PROJECT:table1 was not found in location US
    When I look at the jobId which was failed in EXTRACTOR_PROJECT, I see that there is no projectID in front of the datasetId. So I believe bigquery looks for the table in the EXTRACTOR_PROJECT and returns 404 because that table is in BIGQUERY_PROJECT.
    Copy code
    SELECT * FROM `dataset1.table1` LIMIT 10000
    for comparison I look at the successful jobIds and I see that projectId is added before datasetid.
    Copy code
    SELECT
        *
    FROM
        `BIGQUERY_PROJECT.dataset2.table2`
    WHERE
        DATE(`date`) BETWEEN DATE('2023-03-20 00:00:00') AND DATE('2023-03-21 00:00:00')
    Why datahub is not adding BIGQUERY_PROJECT for all queries? Can someone help me to resolve it?
    d
    • 2
    • 7
  • w

    witty-motorcycle-52108

    03/29/2023, 11:41 PM
    hey all! wondering if it's possible to "backfill" stateful ingestion, i.e turn it on after an ingestion source has been set up and running for a while. i enabled it, but none of the entities are being removed so i'm somewhat assuming that it's because there's no state with those entities, so it doesnt know it should remove them
    d
    • 2
    • 2
  • r

    red-plumber-64268

    03/30/2023, 7:53 AM
    Hello friends! I am trying to get started with datahub, for ingesting bigquery data, but I am getting repeated errors like the one below. The service account I am using has Logs view Accessor and BigQuery admin role, which should be enough? Do you have any ideas as to what could be going wrong?
    Copy code
    ⏳ Pipeline running successfully so far; produced 157 events in 1 minute and 2 seconds.
    [2023-03-30 07:48:19,731] ERROR    {datahub.ingestion.source.bigquery_v2.bigquery:636} - Traceback (most recent call last):
      File "/tmp/datahub/ingest/venv-bigquery-0.10.1/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/bigquery.py", line 626, in _process_project
        yield from self._process_schema(
      File "/tmp/datahub/ingest/venv-bigquery-0.10.1/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/bigquery.py", line 782, in _process_schema
        yield from self._process_view(
      File "/tmp/datahub/ingest/venv-bigquery-0.10.1/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/bigquery.py", line 882, in _process_view
        yield from self.gen_view_dataset_workunits(
      File "/tmp/datahub/ingest/venv-bigquery-0.10.1/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/bigquery.py", line 951, in gen_view_dataset_workunits
        yield from self.gen_dataset_workunits(
      File "/tmp/datahub/ingest/venv-bigquery-0.10.1/lib/python3.10/site-packages/datahub/ingestion/source/bigquery_v2/bigquery.py", line 1007, in gen_dataset_workunits
        lastModified=TimeStamp(time=int(table.last_altered.timestamp() * 1000))
    AttributeError: 'int' object has no attribute 'timestamp'
    
    [2023-03-30 07:48:19,731] ERROR    {datahub.ingestion.source.bigquery_v2.bigquery:637} - Unable to get tables for dataset dashboards in project annotell-com, skipping. Does your service account has bigquery.tables.list, bigquery.routines.get, bigquery.routines.list permission, bigquery.tables.getData permission? The error was: 'int' object has no attribute 'timestamp'
    a
    a
    • 3
    • 6
  • s

    salmon-angle-92685

    03/30/2023, 10:10 AM
    Hello guys, I am trying to keep the track of which applications request which tables on redshift. I was asking myself, then, if there isn't a way of ingesting the different API calls done by Cubejs into Datahub. The idea was for being able to see, on the lineage, the
    Table -> View -> Application
    track. Is there anyway of doing this? Thank you so much for your help!
    a
    • 2
    • 3
  • s

    salmon-angle-92685

    03/30/2023, 10:15 AM
    I am also trying to retrieve which tables are the most requested ones database-side. I've seen that we have the amount of requested queries and, as well, the top 5 recent queries available. Is there a way of listing more than 5 recent queries? Is there also any way of filtering the tables by the most requested in Datahub ? Thanks
    a
    • 2
    • 2
  • b

    boundless-nail-65912

    03/30/2023, 11:35 AM
    Hello Team, Does the datahub supports ml models and stored procedures in any of the database?
    a
    • 2
    • 1
  • p

    proud-dusk-671

    03/30/2023, 11:52 AM
    Hello Team, Can you tell me if AWS MSK is supported for ingestion by Datahub. Please also share relevant doc
    a
    • 2
    • 1
  • s

    salmon-angle-92685

    03/30/2023, 12:12 PM
    Hello Guys, I saw this article about Graph Service Implementation https://datahubproject.io/docs/how/migrating-graph-service-implementation/ . But I cannot find en explanation on how to access this Graph Representation of Datahub data. Could u guys help me? Thanks !
    ✅ 1
    a
    • 2
    • 1
  • e

    enough-noon-12106

    03/30/2023, 12:44 PM
    Do anyone know how we can add this
    Last synchronized *14 hours ago*
    or lastModifed using python emitter in push based ingestion
    ✅ 1
    b
    • 2
    • 6
  • c

    cool-tiger-42613

    03/30/2023, 1:07 PM
    Hello, for a custom data pipeline, how can stateful-ingestion be enabled. Whats the best way to create checkpoints, are there some examples in git for this?
    a
    • 2
    • 1
  • r

    rich-state-73859

    03/30/2023, 4:55 PM
    I got this error when ingesting from dbt source, could anyone help me with this?
    Copy code
    failed to write record with workunit urn:li:assertion:c3675908211ca5988d94475197414b71-assertionInfo with ('Unable to emit metadata to DataHub GMS', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:500]: javax.persistence.PersistenceException: Error when batch flush on sql: insert into metadata_aspect_v2 (urn, aspect, version, metadata, createdOn, createdBy, createdFor, systemmetadata) values (?,?,?,?,?,?,?,?)\n\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:42)\n\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:50)', 'message': 'javax.persistence.PersistenceException: Error when batch flush on sql: insert into metadata_aspect_v2 (urn, aspect, version, metadata, createdOn, createdBy, createdFor, systemmetadata) values (?,?,?,?', 'status': 500, 'id': 'urn:li:assertion:c3675908211ca5988d94475197414b71'}) and info {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:500]: javax.persistence.PersistenceException: Error when batch flush on sql: insert into metadata_aspect_v2 (urn, aspect, version, metadata, createdOn, createdBy, createdFor, systemmetadata) values (?,?,?,?,?,?,?,?)\n\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:42)\n\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:50)', 'message': 'javax.persistence.PersistenceException: Error when batch flush on sql: insert into metadata_aspect_v2 (urn, aspect, version, metadata, createdOn, createdBy, createdFor, systemmetadata) values (?,?,?,?', 'status': 500, 'id': 'urn:li:assertion:c3675908211ca5988d94475197414b71'}
    a
    a
    • 3
    • 12
  • l

    lemon-scooter-69730

    03/31/2023, 12:17 PM
    Using the python SDK are you able to retrieve all datasets with a given platform (e.g, get all datasets in bigquery)
    a
    • 2
    • 1
  • l

    lively-spring-5482

    03/31/2023, 2:26 PM
    Hello, I’m using the csv-enricher source on DataHub v 0.10.1 and it works pretty nice. I’ve noticed that the OVERRIDE mode works as expected replacing the original list of terms or tags with a new one. However, I don’t understand how I could actually remove all the tags/terms with the tool? Submitting an empty list doesn’t seem to help much. I tried leaving an empty field, sending an empty array or a NULL value - to no avail. Is there something I miss or it is simply impossible to reset a list of tags/terms on a dataset once it was initially populated? Thanks in advance for the info 🙂 CSV record:
    Copy code
    resource,subresource,glossary_terms,tags,owners,ownership_type,description,domain 
    "urn:li:dataset:(urn:li:dataPlatform:snowflake,prd_dwh.test_schema.h_test,PROD)",,[],,,,,,
    Exception thrown:
    Copy code
    {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
                            'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:422]: Failed to validate record with class '
                                          'com.linkedin.common.GlossaryTerms: ERROR :: /terms/0/urn :: "Provided urn " is invalid\n'
                                          '\n'
    ✅ 1
    a
    a
    • 3
    • 8
  • t

    tall-vr-26334

    03/31/2023, 2:48 PM
    Hi guys, I'm trying to connect to S3 using the aws_role configuration properties but I'm getting the following error. Does my source configuration look correctly formatted, does anyone have an example of how to do this?
    Copy code
    source:
        type: s3
        config:
            platform: s3
            path_specs:
                -
                    include: 'my_bucket_name'
            aws_config:
                aws_region: my-region
                aws_role:
                    -
                        RoleArn: 'arn'
                        ExternalId: <external_id>
    Here is the error I'm getting
    Copy code
    ~~~~ Execution Summary - RUN_INGEST ~~~~
    Execution finished with errors.
    {'exec_id': '3ad02c48-58b6-4ff9-9a23-a7ea2268f308',
     'infos': ['2023-03-31 14:42:26.606317 INFO: Starting execution for task with name=RUN_INGEST',
               "2023-03-31 14:42:34.708565 INFO: Failed to execute 'datahub ingest'",
               '2023-03-31 14:42:34.708757 INFO: Caught exception EXECUTING task_id=3ad02c48-58b6-4ff9-9a23-a7ea2268f308, name=RUN_INGEST, '
               'stacktrace=Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 122, in execute_task\n'
               '    task_event_loop.run_until_complete(task_future)\n'
               '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
               '    return future.result()\n'
               '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 231, in execute\n'
               '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
               "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
     'errors': []}
    
    ~~~~ Ingestion Report ~~~~
    {
      "cli": {
        "cli_version": "0.10.1",
        "cli_entry_location": "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/__init__.py",
        "py_version": "3.10.10 (main, Mar 14 2023, 02:37:11) [GCC 10.2.1 20210110]",
        "py_exec_path": "/tmp/datahub/ingest/venv-s3-0.10.1/bin/python3",
        "os_details": "Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.31",
        "peak_memory_usage": "167.37 MB",
        "mem_info": "167.37 MB"
      },
      "source": {
        "type": "s3",
        "report": {
          "events_produced": 0,
          "events_produced_per_sec": 0,
          "entities": {},
          "aspects": {},
          "warnings": {},
          "failures": {},
          "filtered": [],
          "start_time": "2023-03-31 14:42:29.926640 (2.33 seconds ago)",
          "running_time": "2.33 seconds"
        }
      },
      "sink": {
        "type": "datahub-rest",
        "report": {
          "total_records_written": 0,
          "records_written_per_second": 0,
          "warnings": [],
          "failures": [],
          "start_time": "2023-03-31 14:42:29.031984 (3.22 seconds ago)",
          "current_time": "2023-03-31 14:42:32.252517 (now)",
          "total_duration_in_seconds": 3.22,
          "gms_version": "v0.10.1",
          "pending_requests": 0
        }
      }
    }
    
    ~~~~ Ingestion Logs ~~~~
    Obtaining venv creation lock...
    Acquired venv creation lock
    venv setup time = 0
    This version of datahub supports report-to functionality
    datahub  ingest run -c /tmp/datahub/ingest/3ad02c48-58b6-4ff9-9a23-a7ea2268f308/recipe.yml --report-to /tmp/datahub/ingest/3ad02c48-58b6-4ff9-9a23-a7ea2268f308/ingestion_report.json
    [2023-03-31 14:42:28,995] INFO     {datahub.cli.ingest_cli:173} - DataHub CLI version: 0.10.1
    [2023-03-31 14:42:29,035] INFO     {datahub.ingestion.run.pipeline:184} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://datahub-gms:8080>
    [2023-03-31 14:42:29,599] ERROR    {logger:26} - Please set env variable SPARK_VERSION
    [2023-03-31 14:42:29,600] INFO     {logger:27} - Using deequ: com.amazon.deequ:deequ:1.2.2-spark-3.0
    [2023-03-31 14:42:30,227] INFO     {datahub.ingestion.run.pipeline:201} - Source configured successfully.
    [2023-03-31 14:42:30,230] INFO     {datahub.cli.ingest_cli:129} - Starting metadata ingestion
    [2023-03-31 14:42:32,253] INFO     {datahub.ingestion.reporting.file_reporter:52} - Wrote UNKNOWN report successfully to <_io.TextIOWrapper name='/tmp/datahub/ingest/3ad02c48-58b6-4ff9-9a23-a7ea2268f308/ingestion_report.json' mode='w' encoding='UTF-8'>
    [2023-03-31 14:42:32,253] INFO     {datahub.cli.ingest_cli:134} - Source (s3) report:
    {'events_produced': 0,
     'events_produced_per_sec': 0,
     'entities': {},
     'aspects': {},
     'warnings': {},
     'failures': {},
     'filtered': [],
     'start_time': '2023-03-31 14:42:29.926640 (2.33 seconds ago)',
     'running_time': '2.33 seconds'}
    [2023-03-31 14:42:32,254] INFO     {datahub.cli.ingest_cli:137} - Sink (datahub-rest) report:
    {'total_records_written': 0,
     'records_written_per_second': 0,
     'warnings': [],
     'failures': [],
     'start_time': '2023-03-31 14:42:29.031984 (3.22 seconds ago)',
     'current_time': '2023-03-31 14:42:32.253876 (now)',
     'total_duration_in_seconds': 3.22,
     'gms_version': 'v0.10.1',
     'pending_requests': 0}
    [2023-03-31 14:42:32,543] ERROR    {datahub.entrypoints:192} - Command failed: 'NoneType' object has no attribute 'access_key'
    Traceback (most recent call last):
      File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/entrypoints.py", line 179, in main
        sys.exit(datahub(standalone_mode=False, **kwargs))
      File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
        return self.main(*args, **kwargs)
      File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/click/core.py", line 1055, in main
        rv = self.invoke(ctx)
      File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/click/core.py", line 760, in invoke
        return __callback(*args, **kwargs)
      File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
        return f(get_current_context(), *args, **kwargs)
      File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 379, in wrapper
        raise e
      File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 334, in wrapper
        res = func(*args, **kwargs)
      File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/utilities/memory_leak_detector.py", line 95, in wrapper
        return func(ctx, *args, **kwargs)
      File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 198, in run
        loop.run_until_complete(run_func_check_upgrade(pipeline))
      File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
        return future.result()
      File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 158, in run_func_check_upgrade
        ret = await the_one_future
      File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 149, in run_pipeline_async
        return await loop.run_in_executor(
      File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
        result = self.fn(*self.args, **self.kwargs)
      File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 140, in run_pipeline_to_completion
        raise e
      File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 132, in run_pipeline_to_completion
        pipeline.run()
      File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 339, in run
        for wu in itertools.islice(
      File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/ingestion/source/s3/source.py", line 744, in get_workunits
        for file, timestamp, size in file_browser:
      File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/ingestion/source/s3/source.py", line 656, in s3_browser
        s3 = self.source_config.aws_config.get_s3_resource(
      File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/ingestion/source/aws/aws_common.py", line 183, in get_s3_resource
        resource = self.get_session().resource(
      File "/tmp/datahub/ingest/venv-s3-0.10.1/lib/python3.10/site-packages/datahub/ingestion/source/aws/aws_common.py", line 139, in get_session
        "AccessKeyId": current_credentials.access_key,
    AttributeError: 'NoneType' object has no attribute 'access_key'
    a
    • 2
    • 3
  • s

    salmon-angle-92685

    03/31/2023, 3:28 PM
    Hello, Is there a way to ingest the Application endpoint from where the Redshift tables are requested ? I have already all the redshift tables on Datahub, but I would love to have the application name as well on the lineage. Thanks !
    a
    • 2
    • 1
  • l

    loud-application-42754

    03/31/2023, 5:49 PM
    Hello all! Trying to ingest custom data formats instead of the standard [csv, tsv, etc..] when doing an ingest from s3. Is there a way to add a new file format with a schema we define?
    a
    • 2
    • 3
  • l

    lively-dusk-19162

    04/01/2023, 7:26 AM
    Hello all, I have created a new entity and ingested some data for the entity. The aspects have been ingested in database but I am unable to view entities in UI. I have updated graphql code too.my browse path is giving null. Can anyone please help me in resolving this error?
    a
    a
    h
    • 4
    • 21
  • m

    mysterious-table-75773

    04/02/2023, 8:54 PM
    Hey, is there an option to run ingestion via API and not just the RUN button in the GUI and the cron scheduler?
    ✅ 1
    s
    m
    • 3
    • 16
  • b

    bitter-evening-61050

    04/03/2023, 6:17 AM
    Hi Team, I am trying to enable sso login in datahub web page using azure ad. Can anyone please help me with this
    a
    • 2
    • 1
  • l

    lively-raincoat-33818

    04/03/2023, 8:11 AM
    Hello everyone, I am ingesting 'dbt source freshness' with the 'source.json' file in Datahub. I have uploaded the file to S3 and I have modified the ingestion in datahub. But it is not showing the data correctly, the 'Last observed' says that it was 15 hours ago which coincides with the hr of the last datahub ingestion. But now that I'm loading the 'sources.json' it should show me the freshness data. Has anyone had this problem and can help me? I have the V0.10.0. Thanks
    ✅ 1
    a
    • 2
    • 2
  • g

    gifted-diamond-19544

    04/03/2023, 11:13 AM
    Hello. I have a question about ingestion. We currently have Datahub deployed and ingesting metadata from AWS Glue (among others). Ingestion was setup via the UI. My question is, if one database is deleted from Glue, will the ingestion process delete this database’s metadata from Datahub, or will this Database still exist on Datahub and we need to delete it manually? Currently using v0.10.0. Thank you
    ✅ 3
    g
    • 2
    • 2
  • l

    lively-night-46534

    04/03/2023, 11:23 AM
    Hey, is column level lineage supported for bigquery yet?
    a
    p
    • 3
    • 2
  • g

    gray-angle-76914

    04/03/2023, 1:36 PM
    hello! I am trying to do an ingest from snowflake using key-pair authentication, but I get this error:
    Copy code
    ~~~~ Execution Summary ~~~~
    
    RUN_INGEST - {'errors': [],
     'exec_id': '6619ef27-9cba-4790-ac3f-538f8a4c3c08',
     'infos': ['2023-04-03 11:07:52.379075 [exec_id=6619ef27-9cba-4790-ac3f-538f8a4c3c08] INFO: Starting execution for task with name=RUN_INGEST',
               '2023-04-03 11:07:52.390782 [exec_id=6619ef27-9cba-4790-ac3f-538f8a4c3c08] INFO: Caught exception EXECUTING '
               'task_id=6619ef27-9cba-4790-ac3f-538f8a4c3c08, name=RUN_INGEST, stacktrace=Traceback (most recent call last):\n'
               '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 123, in execute_task\n'
               '    task_event_loop.run_until_complete(task_future)\n'
               '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete\n'
               '    return future.result()\n'
               '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 70, in execute\n'
               '    recipe: dict = SubProcessTaskUtil._resolve_recipe(validated_args.recipe, ctx, self.ctx)\n'
               '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_task_common.py", line 107, in _resolve_recipe\n'
               '    json_recipe = json.loads(resolved_recipe)\n'
               '  File "/usr/local/lib/python3.10/json/__init__.py", line 346, in loads\n'
               '    return _default_decoder.decode(s)\n'
               '  File "/usr/local/lib/python3.10/json/decoder.py", line 337, in decode\n'
               '    obj, end = self.raw_decode(s, idx=_w(s, 0).end())\n'
               '  File "/usr/local/lib/python3.10/json/decoder.py", line 353, in raw_decode\n'
               '    obj, end = self.scan_once(s, idx)\n'
               'json.decoder.JSONDecodeError: Invalid control character at: line 1 column 1027 (char 1026)\n']}
    Execution finished with errors.
    I am using the following recipe:
    Copy code
    source:
        type: snowflake
        config:
            stateful_ingestion:
                enabled: false
            env: DEV
            platform_instance: <platform>
            authentication_type: KEY_PAIR_AUTHENTICATOR
            private_key_password: '${snf_dev_pkp}'
            private_key: '${snf_dev_pk_str}'
            convert_urns_to_lowercase: true
            include_external_url: true
            database_pattern:
                ignoreCase: true
            include_technical_schema: true
            include_tables: false
            include_table_lineage: false
            include_table_location_lineage: true
            include_views: false
            include_view_linage: true
            include_column_lineage: true
            ignore_start_time_lineage: true
            include_usage_stats: true
            store_last_usage_extraction_timestamp: true
            top_n_queries: 10
            include_top_n_queries: true
            format_sql_queries: true
            include_operational_stats: true
            include_read_operational_stats: false
            apply_view_usage_to_tables: true
            email_domain: none
            store_last_profiling_timestamps: true
            profiling:
                enabled: false
                turn_off_expensive_profiling_metrics: false
                profile_table_level_only: true
                include_field_null_count: true
                include_field_distinct_count: true
                include_field_min_value: true
                include_field_max_value: true
                include_field_mean_value: true
                include_field_median_value: true
                include_field_stddev_value: true
                include_field_quantiles: true
                include_field_distinct_value_frequencies: true
                include_field_histogram: true
                include_field_sample_values: true
                field_sample_values_limit: 4
                query_combiner_enabled: true
    sink:
        type: datahub-rest
        config:
            server: <server>
    where private_key_password and private_key have been added as secrets. Any idea what could be the error? Thanks!
    a
    • 2
    • 1
  • c

    colossal-finland-28298

    04/04/2023, 4:23 AM
    Hello, Team! 👋🏻 I’d like to ingest from PostgreSQL database to my own local(Mac) Datahub running on docker with “profiling” options. I just wanna gather a few(not full) of “Sample Values” ONLY not include any stats even though count of distinct columns and rows. I have 0.10.1 version of acryl datahub and the recipe I used is that below.
    Copy code
    source:
      type: postgres
      config:
        host_port: localhost:5432
        username: postgres
        # database:
        # password:
        include_tables: true
        include_views: true
        schema_pattern:
            deny:
                - information_schema
                - pg_catalog
        profiling:
            enabled: true
            include_field_distinct_count: false
            include_field_min_value: false
            include_field_median_value: false
            include_field_max_value: false
            include_field_mean_value: false       
            include_field_stddev_value: false 
            partition_profiling_enabled: false
            catch_exceptions: false
            query_combiner_enabled: false
            include_field_null_count: false
            field_sample_values_limit: 10
            include_field_distinct_value_frequencies: false
            include_field_histogram: false
            include_field_quantiles: false
            query_combiner_enabled: false
            profile_table_row_count_estimate_only: true
            turn_off_expensive_profiling_metrics: true
    include_field_sample_values
    default value is “true” so I did not put that option in the recipe. I found in ingest log that it has still gathering the count of distinct even turn all options of profiling off except the one which is getting sample values(data).
    Copy code
    2023-03-31 18:34:32,666 INFO sqlalchemy.engine.Engine [cached since 0.3957s ago] {}
    2023-03-31 18:34:32,668 INFO sqlalchemy.engine.Engine SELECT count(distinct(bid)) AS count_1 
    FROM public.pgbench_branches
    How can I turn distinct count off in profiling mode? Thank you in advance!
    a
    a
    t
    • 4
    • 7
  • c

    curved-planet-99787

    04/04/2023, 6:53 AM
    Hi all, I'm having problems with ingesting Tableau resources with the recent release (
    0.10.1
    ). After the successful login to Tableau and the retrieval of all projects, the ingestion starts querying the metadata API which ends with the following error:
    Copy code
    2023-04-04 06:42:00,701 - [INFO] - [tableau.endpoint.datasources:73] - Querying all datasources on site
    2023-04-04 06:42:00,795 - [INFO] - [tableau.endpoint.metadata:61] - Querying Metadata API
    2023-04-04 06:42:00,868 - [ERROR] - [datahub.ingestion.run.pipeline:389] - Caught error
    Traceback (most recent call last):
      File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 339, in run
        for wu in itertools.islice(
      File "/usr/local/lib/python3.10/site-packages/datahub/utilities/source_helpers.py", line 90, in auto_stale_entity_removal
        for wu in stream:
      File "/usr/local/lib/python3.10/site-packages/datahub/utilities/source_helpers.py", line 41, in auto_status_aspect
        for wu in stream:
      File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/tableau.py", line 2247, in get_workunits_internal
        yield from self.emit_workbooks()
      File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/tableau.py", line 708, in emit_workbooks
        for workbook in self.get_connection_objects(
      File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/tableau.py", line 687, in get_connection_objects
        ) = self.get_connection_object_page(
      File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/source/tableau.py", line 650, in get_connection_object_page
        raise RuntimeError(f"Query {connection_type} error: {errors}")
    RuntimeError: Query workbooksConnection error: [{'message': "Validation error of type FieldUndefined: Field 'projectLuid' in type 'Workbook' is undefined @ 'workbooksConnection/nodes/projectLuid'", 'locations': [{'line': 9, 'column': 7, 'sourceName': None}], 'description': "Field 'projectLuid' in type 'Workbook' is undefined", 'validationErrorType': 'FieldUndefined', 'queryPath': ['workbooksConnection', 'nodes', 'projectLuid'], 'errorType': 'ValidationError', 'extensions': None, 'path': None}]
    Can someone help me by pointing me to the potential root cause? I can't really trace the problem here, but it looks like a DataHub internal issue to me
    plus1 1
    a
    a
    a
    • 4
    • 4
  • p

    purple-microphone-86243

    04/04/2023, 7:58 AM
    Hi Team, I'm trying to create a custom ingestion source and integrate it with my local datahub running on docker. I've created one as per the docs here: 1. https://datahubproject.io/docs/metadata-ingestion/adding-source/ 2. https://datahubproject.io/docs/how/add-custom-ingestion-source/ 3. https://github.com/acryldata/meta-world I used the meta-world example for my ingestion. Pip installed the meta-world package and tried running the recipe.yaml file (using datahub ingest cmd) but I'm getting "Failed to find a registered source: No module named my-souce". Can someone help me?
    g
    f
    • 3
    • 9
  • m

    most-animal-32096

    04/04/2023, 3:47 PM
    Hello, I just opened the issue #7748 about the discrepancy I spotted between the Python-based
    metadata-ingestion
    layer and
    DataProcessInstance
    entity schema/aspects definition, that leads to a
    500
    error on REST
    POST
    call to
    datahub-gms
    . Feel free to comment with your opinion of any possible solution (as there are 2 possible sides to fix it).
    ✅ 1
    a
    • 2
    • 1
  • g

    green-lion-58215

    04/04/2023, 9:46 PM
    Hi team, Has there been any changes to how the redshift ingestion works for package version 0.9.0? I have been running the ingestion successfully for last few days and from March 29 onwards, some of my schema ingestion is failing due to a memory issue. Seems like the ingestion is running for a lot longer than I expected. I had this issue before and I explicitly set the “include_table_lineage”: False and it was running fine. I did not make any code changes and I am confused as to why it is failing now. I am explicitly using the 0.9.0 version of package. Any help on this is much appreciated.
    e
    a
    +4
    • 7
    • 34
  • s

    steep-waitress-15973

    04/05/2023, 3:01 AM
    Hi all, anyone had experience in ingesting metadata from Microsoft Dataverse? Any recipe?
    a
    • 2
    • 1
  • a

    acceptable-midnight-32657

    04/05/2023, 8:29 AM
    Hi folks! Is it possible to add a new entity (dataset or dashboard) by ingesting it from a file (json, yaml, csv, etc.)? Cann't find the answer in the documentation. Since there is an option to ingest sample data, it must be some API for that. Where can I find the source with that API description (and file formats for feeding that API)?
    ✅ 1
    m
    • 2
    • 2
1...113114115...144Latest