DataHub #ingestion

red-pizza-28006

08/16/2022, 7:15 AM

hello everyone - anyone using aws secrets manager to store their connection and then running datahub on MWAA 2.2.2? We started running into this error. Any ideas?

Copy code

[2022-08-16, 01:10:09 UTC] {{taskinstance.py:1703}} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1332, in _run_raw_task
    self._execute_task_with_callbacks(context)
  File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1458, in _execute_task_with_callbacks
    result = self._execute_task(context, self.task)
  File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1509, in _execute_task
    result = execute_callable(context=context)
  File "/usr/local/lib/python3.7/site-packages/airflow/operators/python.py", line 149, in execute
    self.op_kwargs = determine_kwargs(self.python_callable, self.op_args, context)
  File "/usr/local/lib/python3.7/site-packages/airflow/utils/operator_helpers.py", line 111, in determine_kwargs
    raise ValueError(f"The key {name} in args is part of kwargs and therefore reserved.")
ValueError: The key conn in args is part of kwargs and therefore reserved.

brave-tomato-16287

08/16/2022, 9:33 AM

Hello all! Is it possible to ingest Prep Flows from Tableau Server?

alert-fall-82501

08/16/2022, 1:51 PM

'records_written': '13', 'warnings': [], 'failures': [{'error': 'Unable to emit metadata to DataHub GMS', 'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status422] ' 'com.linkedin.metadata.entity.ValidationException: Failed to validate record with class ' 'com.linkedin.entity.Entity: ERROR :: ' '/value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/0/com.linkedin.dataset.DatasetProperties/name ' ':: unrecognized field found but not allowed\n' '\n

alert-fall-82501

08/16/2022, 1:52 PM

can anybody suggest on this error?

refined-ability-35859

08/16/2022, 3:30 PM

Hello, i get the below error when i try to ingest views from vertica db using the beta version connector. Can anyone suggest on this error? The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/datahub/ingestion/source/sql/sql_common.py", line 1175, in loop_views yield from self._process_view( File "/usr/local/lib/python3.8/dist-packages/datahub/ingestion/source/sql/sql_common.py", line 1219, in _process_view view_definition = inspector.get_view_definition(view, schema) File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/reflection.py", line 337, in get_view_definition return self.dialect.get_view_definition( File "<string>", line 2, in get_view_definition File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/reflection.py", line 52, in cache ret = fn(self, con, *args, **kw) File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/dialects/postgresql/base.py", line 3022, in get_view_definition view_def = connection.scalar( File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 941, in scalar return self.execute(object_, *multiparams, **params).scalar() File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1011, in execute return meth(self, multiparams, params) File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection return connection._execute_clauseelement(self, multiparams, params) File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1124, in _execute_clauseelement ret = self._execute_context( File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1316, in _execute_context self._handle_dbapi_exception( File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1510, in _handle_dbapi_exception util.raise_( File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/util/compat.py", line 182, in raise_ raise exception File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context self.dialect.do_execute( File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py", line 608, in do_execute cursor.execute(statement, parameters) File "/usr/local/lib/python3.8/dist-packages/vertica_python/vertica/cursor.py", line 222, in execute self._execute_simple_query(operation) File "/usr/local/lib/python3.8/dist-packages/vertica_python/vertica/cursor.py", line 606, in _execute_simple_query raise errors.QueryError.from_error_response(self._message, query) sqlalchemy.exc.ProgrammingError: (vertica_python.errors.MissingRelation) Severity: ERROR, Message: Relation "pg_class" does not exist, Sqlstate: 42V01, Routine: throwRelationDoesNotExist, File: /data/jenkins/workspace/RE-ReleaseBuilds/RE-Knuckleboom/server/vertica/Catalog/CatalogLookup.cpp, Line: 4118, Error Code: 4566, SQL: "SELECT pg_get_viewdef(c.oid) view_def FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace WHERE n.nspname = 'dominodatalab_schema' AND c.relname = 'csv_telco_view' AND c.relkind IN ('v', 'm')" [SQL: SELECT pg_get_viewdef(c.oid) view_def FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace WHERE n.nspname = :schema AND c.relname = :view_name AND c.relkind IN ('v', 'm')] [parameters: {'schema': 'dominodatalab_schema', 'view_name': 'csv_telco_view'}]

kind-whale-32412

08/16/2022, 5:29 PM

Hello there: For Java Emit library: https://github.com/datahub-project/datahub/blob/master/metadata-integration/java/as-a-library.md It only supports emit function for HTTP emitter. This is very limiting as compared to Python library where you can expand this emitter with

DataHubGraph

For instance in Python library I can make get requests (like get_aspect_v2) whereas for Java HTTP Emitter I can't. I know that I can write my own http library to do this, but knowing that this is already supported in DataHubs own source code in

DefaultRestliClientFactory

it shouldn't really be that much work to allow these operations in Java library. (can just make them available in datahub-client maven package) Please do let me know if I missed out an easy way to get the REST client in Java (other than writing my own one or copy/pasting classes around)

creamy-tent-10151

08/16/2022, 7:29 PM

Hi all, My team was wondering if there would be possible for support for SAS or SAS file ingestion and if there was any plans of implementing so in the future, thanks!

eager-terabyte-73886

08/16/2022, 8:55 PM

Hi, I am completely new to datahub, I set it up locally. I also ingested some data (created a yaml file). How can I go about seeing what changes that ingestion makes in my local setup?

straight-agent-79732

08/17/2022, 2:49 AM

Hi, is passing file url (s3 bucket file url) supported in datahub-business-glossary recipe?

straight-agent-79732

08/17/2022, 7:37 AM

I had setup datahub in our servers using quickstart, form the UI even though I have given manage token policy for datahub user. UI is showing following error "Token based authentication is currently disabled. Contact your DataHub administrator to enable this feature." Can someone tell me where to configure this apart from policies

eager-terabyte-73886

08/17/2022, 8:33 AM

I am trying to ingest a sample yaml file, it isn't working. I am using postgress, can anyone help?

square-hair-99480

08/17/2022, 8:38 AM

Hello friends, question. I am ingesting n Snowflake databases which are very similar (same schemas, and tables) but with small differences regarding columns. Say I have dwh_es, dwh_de, dwh_fr, dwh_it ... they are separate business but run basically the same business model. So my problem is that if I add documentation to tables in dwh_es the documentations for the remaining dwh_'s will be very similar with very small adjustments so I do not want to add it manually. Is there a simple way to deal with these case without having to manually add and keep updated repeated documentation but allowing for the small differences? Something like: updates in the UI -> control file update API/GraphQL -> control file updates on control file -> show in UI -> control files for databases in set {dwh_es, dwh_fr, dwh_it} are kept in sync That could mean if I add a documentation to dwh_es say to a specific column and this column has no documentation in the others or the documentation is identical it gets synced.

clean-monkey-7245

08/17/2022, 8:43 AM

team, while we running snowflake ingestion we are getting this exception

clean-monkey-7245

08/17/2022, 8:43 AM

Copy code

'container-urn:li:container:d0d5d558096e9a8f47ffc324b65cddc0-to-urn:li:dataset:(urn:li:dataPlatform:snowflake,max_dev.identity.hbo_adst_full_analysis,PROD)\n'
           '/usr/local/bin/run_ingest.sh: line 33:   583 Killed                  ( datahub ingest run -c "${recipe_file}" ${report_option} )\n',

sparse-forest-98608

08/17/2022, 8:51 AM

I want to extract meta data from csv file and push that meta data to datahub

sparse-forest-98608

08/17/2022, 8:52 AM

CSV may be with or without header, dynamic columns and datatypes

microscopic-mechanic-13766

08/17/2022, 9:10 AM

Good morning everyone, quick question: Let's suppose I have ingested data from PostgreSQL. If I created, after the ingestion, a new table in the database that was ingested into Datahub: would that table be shown in Datahub automatically or would I have to ingest from the database again? Just asking this because I was always doing the second option (reingesting), which is not optimal. Moreover, I have just remembered that, as Datahub has a 3rd generation architecture, it should be possible to ingest data in real time from the sources. Right?

fresh-cricket-75926

08/17/2022, 9:25 AM

Hello community ,Good day , During recipe ingesting , we are trying to retain existing tag for each dataset and try to avoid overwriting any tags . Any solution/suggestion to implement in recipe will be helpful here . In one of the thread , i got to know that this unfortunately a known problem.

alert-fall-82501

08/17/2022, 9:57 AM

Hi Team - I have one issue with ingestion that I am running config file through apache airflow DAG jobs, The s3 datalake is my source and our own server there for datahub . after running this , I am getting the only s3 path but no data table ....Please suggest

busy-glass-61431

08/17/2022, 10:23 AM

Hi Team I am trying to ingest data from AWS Glue. For a table I have in my glue catalog, I can see the following properties:

NumberOfBuckets

StoredAsSubDirectories

SortColumns

in Datahub which show incorrect values. Eg: NumberOfBuckets: 0, StoredAsSubDirectories: false.. Anyone else seeing the same issue OR could point me as to what I may be missing/doing incorrectly? I have cross verified the IAM Permissions, I have given :

Copy code

"Action": [
        "glue:GetDatabases",
        "glue:GetTables",
        "glue:GetDataflowGraph",
        "glue:GetJobs",
        "s3:GetObject"
    ]

jolly-traffic-67085

08/17/2022, 10:36 AM

Hi team, I have a issue, When I ingestion a new data source success ,but I don't found urn and metadata in RDMS in datahub-prerequisites-mysql, but data show in datahub UI, I need to know why don't write in datahub-prerequisites-mysql, thanks.

microscopic-mechanic-13766

08/17/2022, 11:23 AM

Hello again, I am trying to ingest metadata from a Kerberized Trino but I keep getting this error

Copy code

TypeError: __init__() got an unexpected keyword argument 'kerberos_service_name'

my recipe is the following:

Copy code

sink:
    type: datahub-rest
    config:
        server: '<http://datahub-gms:8080>'
source:
    type: trino
    config:
        database: null
        host_port: 'trino-coordinator:8080'
        username: trino
        options:
            connect_args:
                http_scheme: https
                auth: KERBEROS
                kerberos_service_name: trino-hive

Note: There is no port colission as the port of the gms is not exposed externally.

calm-dinner-63735

08/17/2022, 11:28 AM

i am trying ingest some meta data info from S3 but the job is failing, the error is in message thread

creamy-church-10353

08/17/2022, 12:25 PM

Hello everyone, Hope you are doing well. I'm facing an issue while datahub ingestion via file-base-lineage; it is logging

'records_written': 0

, whereas I have defined 2 entities in

lineage.yaml

. Adding below more details FYI. These entities are at

root

level in the lineage tree, they have no upstream. -

lineage.yaml

file (I have tried with blank

upstream: []

as well but getting the same result)

Copy code

lineage:
- entity:
    env: dev
    name: self
    platform: airflow
    platform_instance: demo
    type: dataset
- entity:
    env: dev
    name: etl_clseq_36
    platform: airflow
    platform_instance: demo
    type: dataset
version: 1

airflow-recipe.yaml

file

Copy code

sink:
  config:
    server: <http://datahub-gms:8080>
    token: '$DATAHUB_GMS_TOKEN'
  type: datahub-rest
source:
  config:
    file: lineage.yaml
    preserve_upstream: true
  type: datahub-lineage-file

- datahub ingestion command

Copy code

$ datahub ingest run -c airflow-recipe.yaml

Source (datahub-lineage-file) report:
{'workunits_produced': 0,
 'workunit_ids': [],
 'warnings': {},
 'failures': {},
 'cli_version': '0.8.33',
 'cli_entry_location': 'python3.8/site-packages/datahub/__init__.py',
 'py_version': '3.8.13 (default, May  8 2022, 17:52:27) \n[Clang 13.1.6 (clang-1316.0.21.2)]',
 'py_exec_path': 'python3.8',
 'os_details': '....'}

Sink (datahub-rest) report:
{'records_written': 0,
 'warnings': [],
 'failures': [],
 'downstream_start_time': None,
 'downstream_end_time': None,
 'downstream_total_latency_in_seconds': None,
 'gms_version': 'v0.8.33'}
...

I'm not getting any error log. Please point out what am I doing wrong here.

few-holiday-55907

08/17/2022, 1:41 PM

Hi everyone! Has anyone experience with DBT Cloud & Datahub hosted on GCP? We are checking if Datahub could be a solution for us - however, implementing DBT Cloud seems like a bigger issue, since we cannot easily get the needed files.

plus one 1

bland-teacher-2077

08/17/2022, 3:47 PM

Hi! I am new to datahub and created a simple recipe to an s3 bucket below. The two questions I have are: • Is the transformer type

simple_add_dataset_domain

available for the s3 source? I am using CLI version: 0.8.43.1 and receiving the error:

"[2022-08-17 15:44:14,225] ERROR    {datahub.entrypoints:188} - Command failed with 'Did not find a registered class for "

"simple_add_dataset_domain'. Run with --debug to get full trace\n"

• Is it possible to also upload metadata in addition to the bucket and object tags? Here's the successful recipe. The error occurs when I add the transformer type

simple_add_dataset_domain

transformers:

type: simple_add_dataset_tags

config:

tag_urns:

- 'urn:li:tag:dummytagone'

- 'urn:li:tag:dummytagtwo'

type: simple_add_dataset_terms

config:

term_urns:

- 'urn:li:glossaryTerm:sampleterm1'

- 'urn:li:glossaryTerm:sampleterm2'

type: simple_add_dataset_ownership

config:

owner_urns:

- 'urn:li:corpuser:datahub'

- 'urn:li:corpGroup:Sample Group'

ownership_type: PRODUCER

sink:

type: datahub-rest

config:

server: '<http://datahub-gms:8080>'

source:

type: s3

config:

profiling:

enabled: false

use_s3_object_tags: true

use_s3_bucket_tags: true

path_specs:

include: 's3://*****/*****/*.*'

env: PROD

aws_config:

aws_access_key_id: *****

aws_region: *****

aws_secret_access_key: *****

alert-fall-82501

08/17/2022, 11:57 AM

Can anybody suggest on this error ?

errordata.txt

dazzling-insurance-83303

08/17/2022, 7:20 PM

Bubbling this up for a confirmation. Thanks!

brash-airport-6045

08/17/2022, 8:04 PM

Hello guys! How can I delete the tables from DataHub that I have deleted from my source database. I know DataHub supports this functionality but I can’t seem to find where that is. Thanks for the help.

breezy-controller-54597

08/18/2022, 5:25 AM

About UI-based ingestion. The current datahub-actions image does not install plugins for each data source, and install them at runtime. However, this method does not work in offline environment. Since the datahub-ingestion image has all plugins installed, why not use datahub-ingestion container for UI-based ingestion and datahub-actions triggers datahub-ingestion? If deployed on Kubernetes, it would also be nice to run datahub-ingestion as Pod in kubernetes-client. Basically, using the same version of datahub-ingestion as datahub-gms would be fine.