https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • r

    red-pizza-28006

    08/16/2022, 7:15 AM
    hello everyone - anyone using aws secrets manager to store their connection and then running datahub on MWAA 2.2.2? We started running into this error. Any ideas?
    Copy code
    [2022-08-16, 01:10:09 UTC] {{taskinstance.py:1703}} ERROR - Task failed with exception
    Traceback (most recent call last):
      File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1332, in _run_raw_task
        self._execute_task_with_callbacks(context)
      File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1458, in _execute_task_with_callbacks
        result = self._execute_task(context, self.task)
      File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1509, in _execute_task
        result = execute_callable(context=context)
      File "/usr/local/lib/python3.7/site-packages/airflow/operators/python.py", line 149, in execute
        self.op_kwargs = determine_kwargs(self.python_callable, self.op_args, context)
      File "/usr/local/lib/python3.7/site-packages/airflow/utils/operator_helpers.py", line 111, in determine_kwargs
        raise ValueError(f"The key {name} in args is part of kwargs and therefore reserved.")
    ValueError: The key conn in args is part of kwargs and therefore reserved.
    d
    • 2
    • 1
  • b

    brave-tomato-16287

    08/16/2022, 9:33 AM
    Hello all! Is it possible to ingest Prep Flows from Tableau Server?
    h
    h
    • 3
    • 2
  • a

    alert-fall-82501

    08/16/2022, 1:51 PM
    'records_written': '13', 'warnings': [], 'failures': [{'error': 'Unable to emit metadata to DataHub GMS', 'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status422] ' 'com.linkedin.metadata.entity.ValidationException: Failed to validate record with class ' 'com.linkedin.entity.Entity: ERROR :: ' '/value/com.linkedin.metadata.snapshot.DatasetSnapshot/aspects/0/com.linkedin.dataset.DatasetProperties/name ' ':: unrecognized field found but not allowed\n' '\n
    d
    • 2
    • 10
  • a

    alert-fall-82501

    08/16/2022, 1:52 PM
    can anybody suggest on this error?
  • r

    refined-ability-35859

    08/16/2022, 3:30 PM
    Hello, i get the below error when i try to ingest views from vertica db using the beta version connector. Can anyone suggest on this error? The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/datahub/ingestion/source/sql/sql_common.py", line 1175, in loop_views yield from self._process_view( File "/usr/local/lib/python3.8/dist-packages/datahub/ingestion/source/sql/sql_common.py", line 1219, in _process_view view_definition = inspector.get_view_definition(view, schema) File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/reflection.py", line 337, in get_view_definition return self.dialect.get_view_definition( File "<string>", line 2, in get_view_definition File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/reflection.py", line 52, in cache ret = fn(self, con, *args, **kw) File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/dialects/postgresql/base.py", line 3022, in get_view_definition view_def = connection.scalar( File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 941, in scalar return self.execute(object_, *multiparams, **params).scalar() File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1011, in execute return meth(self, multiparams, params) File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection return connection._execute_clauseelement(self, multiparams, params) File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1124, in _execute_clauseelement ret = self._execute_context( File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1316, in _execute_context self._handle_dbapi_exception( File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1510, in _handle_dbapi_exception util.raise_( File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/util/compat.py", line 182, in raise_ raise exception File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context self.dialect.do_execute( File "/usr/local/lib/python3.8/dist-packages/sqlalchemy/engine/default.py", line 608, in do_execute cursor.execute(statement, parameters) File "/usr/local/lib/python3.8/dist-packages/vertica_python/vertica/cursor.py", line 222, in execute self._execute_simple_query(operation) File "/usr/local/lib/python3.8/dist-packages/vertica_python/vertica/cursor.py", line 606, in _execute_simple_query raise errors.QueryError.from_error_response(self._message, query) sqlalchemy.exc.ProgrammingError: (vertica_python.errors.MissingRelation) Severity: ERROR, Message: Relation "pg_class" does not exist, Sqlstate: 42V01, Routine: throwRelationDoesNotExist, File: /data/jenkins/workspace/RE-ReleaseBuilds/RE-Knuckleboom/server/vertica/Catalog/CatalogLookup.cpp, Line: 4118, Error Code: 4566, SQL: "SELECT pg_get_viewdef(c.oid) view_def FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace WHERE n.nspname = 'dominodatalab_schema' AND c.relname = 'csv_telco_view' AND c.relkind IN ('v', 'm')" [SQL: SELECT pg_get_viewdef(c.oid) view_def FROM pg_class c JOIN pg_namespace n ON n.oid = c.relnamespace WHERE n.nspname = :schema AND c.relname = :view_name AND c.relkind IN ('v', 'm')] [parameters: {'schema': 'dominodatalab_schema', 'view_name': 'csv_telco_view'}]
    h
    • 2
    • 1
  • k

    kind-whale-32412

    08/16/2022, 5:29 PM
    Hello there: For Java Emit library: https://github.com/datahub-project/datahub/blob/master/metadata-integration/java/as-a-library.md It only supports emit function for HTTP emitter. This is very limiting as compared to Python library where you can expand this emitter with
    DataHubGraph
    For instance in Python library I can make get requests (like get_aspect_v2) whereas for Java HTTP Emitter I can't. I know that I can write my own http library to do this, but knowing that this is already supported in DataHubs own source code in
    DefaultRestliClientFactory
    it shouldn't really be that much work to allow these operations in Java library. (can just make them available in datahub-client maven package) Please do let me know if I missed out an easy way to get the REST client in Java (other than writing my own one or copy/pasting classes around)
    m
    b
    • 3
    • 3
  • c

    creamy-tent-10151

    08/16/2022, 7:29 PM
    Hi all, My team was wondering if there would be possible for support for SAS or SAS file ingestion and if there was any plans of implementing so in the future, thanks!
    g
    • 2
    • 1
  • e

    eager-terabyte-73886

    08/16/2022, 8:55 PM
    Hi, I am completely new to datahub, I set it up locally. I also ingested some data (created a yaml file). How can I go about seeing what changes that ingestion makes in my local setup?
    c
    b
    • 3
    • 4
  • s

    straight-agent-79732

    08/17/2022, 2:49 AM
    Hi, is passing file url (s3 bucket file url) supported in datahub-business-glossary recipe?
    d
    • 2
    • 2
  • s

    straight-agent-79732

    08/17/2022, 7:37 AM
    I had setup datahub in our servers using quickstart, form the UI even though I have given manage token policy for datahub user. UI is showing following error "Token based authentication is currently disabled. Contact your DataHub administrator to enable this feature." Can someone tell me where to configure this apart from policies
    b
    • 2
    • 3
  • e

    eager-terabyte-73886

    08/17/2022, 8:33 AM
    I am trying to ingest a sample yaml file, it isn't working. I am using postgress, can anyone help?
    d
    • 2
    • 1
  • s

    square-hair-99480

    08/17/2022, 8:38 AM
    Hello friends, question. I am ingesting n Snowflake databases which are very similar (same schemas, and tables) but with small differences regarding columns. Say I have dwh_es, dwh_de, dwh_fr, dwh_it ... they are separate business but run basically the same business model. So my problem is that if I add documentation to tables in dwh_es the documentations for the remaining dwh_'s will be very similar with very small adjustments so I do not want to add it manually. Is there a simple way to deal with these case without having to manually add and keep updated repeated documentation but allowing for the small differences? Something like: updates in the UI -> control file update API/GraphQL -> control file updates on control file -> show in UI -> control files for databases in set {dwh_es, dwh_fr, dwh_it} are kept in sync That could mean if I add a documentation to dwh_es say to a specific column and this column has no documentation in the others or the documentation is identical it gets synced.
    g
    • 2
    • 3
  • c

    clean-monkey-7245

    08/17/2022, 8:43 AM
    team, while we running snowflake ingestion we are getting this exception
    g
    • 2
    • 2
  • c

    clean-monkey-7245

    08/17/2022, 8:43 AM
    Copy code
    'container-urn:li:container:d0d5d558096e9a8f47ffc324b65cddc0-to-urn:li:dataset:(urn:li:dataPlatform:snowflake,max_dev.identity.hbo_adst_full_analysis,PROD)\n'
               '/usr/local/bin/run_ingest.sh: line 33:   583 Killed                  ( datahub ingest run -c "${recipe_file}" ${report_option} )\n',
  • s

    sparse-forest-98608

    08/17/2022, 8:51 AM
    I want to extract meta data from csv file and push that meta data to datahub
  • s

    sparse-forest-98608

    08/17/2022, 8:52 AM
    CSV may be with or without header, dynamic columns and datatypes
    d
    • 2
    • 2
  • m

    microscopic-mechanic-13766

    08/17/2022, 9:10 AM
    Good morning everyone, quick question: Let's suppose I have ingested data from PostgreSQL. If I created, after the ingestion, a new table in the database that was ingested into Datahub: would that table be shown in Datahub automatically or would I have to ingest from the database again? Just asking this because I was always doing the second option (reingesting), which is not optimal. Moreover, I have just remembered that, as Datahub has a 3rd generation architecture, it should be possible to ingest data in real time from the sources. Right?
    g
    • 2
    • 4
  • f

    fresh-cricket-75926

    08/17/2022, 9:25 AM
    Hello community ,Good day , During recipe ingesting , we are trying to retain existing tag for each dataset and try to avoid overwriting any tags . Any solution/suggestion to implement in recipe will be helpful here . In one of the thread , i got to know that this unfortunately a known problem.
    b
    • 2
    • 1
  • a

    alert-fall-82501

    08/17/2022, 9:57 AM
    Hi Team - I have one issue with ingestion that I am running config file through apache airflow DAG jobs, The s3 datalake is my source and our own server there for datahub . after running this , I am getting the only s3 path but no data table ....Please suggest
    d
    • 2
    • 4
  • b

    busy-glass-61431

    08/17/2022, 10:23 AM
    Hi Team I am trying to ingest data from AWS Glue. For a table I have in my glue catalog, I can see the following properties:
    NumberOfBuckets
    ,
    StoredAsSubDirectories
    ,
    SortColumns
    in Datahub which show incorrect values. Eg: NumberOfBuckets: 0, StoredAsSubDirectories: false.. Anyone else seeing the same issue OR could point me as to what I may be missing/doing incorrectly? I have cross verified the IAM Permissions, I have given :
    Copy code
    "Action": [
            "glue:GetDatabases",
            "glue:GetTables",
            "glue:GetDataflowGraph",
            "glue:GetJobs",
            "s3:GetObject"
        ]
    d
    • 2
    • 3
  • j

    jolly-traffic-67085

    08/17/2022, 10:36 AM
    Hi team, I have a issue, When I ingestion a new data source success ,but I don't found urn and metadata in RDMS in datahub-prerequisites-mysql, but data show in datahub UI, I need to know why don't write in datahub-prerequisites-mysql, thanks.
    g
    • 2
    • 7
  • m

    microscopic-mechanic-13766

    08/17/2022, 11:23 AM
    Hello again, I am trying to ingest metadata from a Kerberized Trino but I keep getting this error
    Copy code
    TypeError: __init__() got an unexpected keyword argument 'kerberos_service_name'
    my recipe is the following:
    Copy code
    sink:
        type: datahub-rest
        config:
            server: '<http://datahub-gms:8080>'
    source:
        type: trino
        config:
            database: null
            host_port: 'trino-coordinator:8080'
            username: trino
            options:
                connect_args:
                    http_scheme: https
                    auth: KERBEROS
                    kerberos_service_name: trino-hive
    Note: There is no port colission as the port of the gms is not exposed externally.
    d
    • 2
    • 7
  • c

    calm-dinner-63735

    08/17/2022, 11:28 AM
    i am trying ingest some meta data info from S3 but the job is failing, the error is in message thread
    g
    • 2
    • 4
  • c

    creamy-church-10353

    08/17/2022, 12:25 PM
    Hello everyone, Hope you are doing well. I'm facing an issue while datahub ingestion via file-base-lineage; it is logging
    'records_written': 0
    , whereas I have defined 2 entities in
    lineage.yaml
    . Adding below more details FYI. These entities are at
    root
    level in the lineage tree, they have no upstream. -
    lineage.yaml
    file (I have tried with blank
    upstream: []
    as well but getting the same result)
    Copy code
    lineage:
    - entity:
        env: dev
        name: self
        platform: airflow
        platform_instance: demo
        type: dataset
    - entity:
        env: dev
        name: etl_clseq_36
        platform: airflow
        platform_instance: demo
        type: dataset
    version: 1
    -
    airflow-recipe.yaml
    file
    Copy code
    sink:
      config:
        server: <http://datahub-gms:8080>
        token: '$DATAHUB_GMS_TOKEN'
      type: datahub-rest
    source:
      config:
        file: lineage.yaml
        preserve_upstream: true
      type: datahub-lineage-file
    - datahub ingestion command
    Copy code
    $ datahub ingest run -c airflow-recipe.yaml
    
    Source (datahub-lineage-file) report:
    {'workunits_produced': 0,
     'workunit_ids': [],
     'warnings': {},
     'failures': {},
     'cli_version': '0.8.33',
     'cli_entry_location': 'python3.8/site-packages/datahub/__init__.py',
     'py_version': '3.8.13 (default, May  8 2022, 17:52:27) \n[Clang 13.1.6 (clang-1316.0.21.2)]',
     'py_exec_path': 'python3.8',
     'os_details': '....'}
    
    Sink (datahub-rest) report:
    {'records_written': 0,
     'warnings': [],
     'failures': [],
     'downstream_start_time': None,
     'downstream_end_time': None,
     'downstream_total_latency_in_seconds': None,
     'gms_version': 'v0.8.33'}
    ...
    I'm not getting any error log. Please point out what am I doing wrong here.
    g
    • 2
    • 7
  • f

    few-holiday-55907

    08/17/2022, 1:41 PM
    Hi everyone! Has anyone experience with DBT Cloud & Datahub hosted on GCP? We are checking if Datahub could be a solution for us - however, implementing DBT Cloud seems like a bigger issue, since we cannot easily get the needed files.
    plus one 1
    g
    • 2
    • 3
  • b

    bland-teacher-2077

    08/17/2022, 3:47 PM
    Hi! I am new to datahub and created a simple recipe to an s3 bucket below. The two questions I have are: • Is the transformer type
    simple_add_dataset_domain
    available for the s3 source? I am using CLI version: 0.8.43.1 and receiving the error:
    "[2022-08-17 15:44:14,225] ERROR    {datahub.entrypoints:188} - Command failed with 'Did not find a registered class for "
    "simple_add_dataset_domain'. Run with --debug to get full trace\n"
    • Is it possible to also upload metadata in addition to the bucket and object tags? Here's the successful recipe. The error occurs when I add the transformer type
    simple_add_dataset_domain
    :
    transformers:
    -
    type: simple_add_dataset_tags
    config:
    tag_urns:
    - 'urn:li:tag:dummytagone'
    - 'urn:li:tag:dummytagtwo'
    -
    type: simple_add_dataset_terms
    config:
    term_urns:
    - 'urn:li:glossaryTerm:sampleterm1'
    - 'urn:li:glossaryTerm:sampleterm2'
    -
    type: simple_add_dataset_ownership
    config:
    owner_urns:
    - 'urn:li:corpuser:datahub'
    - 'urn:li:corpGroup:Sample Group'
    ownership_type: PRODUCER
    sink:
    type: datahub-rest
    config:
    server: '<http://datahub-gms:8080>'
    source:
    type: s3
    config:
    profiling:
    enabled: false
    use_s3_object_tags: true
    use_s3_bucket_tags: true
    path_specs:
    -
    include: 's3://*****/*****/*.*'
    env: PROD
    aws_config:
    aws_access_key_id: *****
    aws_region: *****
    aws_secret_access_key: *****
    g
    • 2
    • 9
  • a

    alert-fall-82501

    08/17/2022, 11:57 AM
    Can anybody suggest on this error ?
    errordata.txt
    g
    d
    • 3
    • 13
  • d

    dazzling-insurance-83303

    08/17/2022, 7:20 PM
    Bubbling this up for a confirmation. Thanks!
    g
    c
    • 3
    • 4
  • b

    brash-airport-6045

    08/17/2022, 8:04 PM
    Hello guys! How can I delete the tables from DataHub that I have deleted from my source database. I know DataHub supports this functionality but I can’t seem to find where that is. Thanks for the help.
    g
    • 2
    • 3
  • b

    breezy-controller-54597

    08/18/2022, 5:25 AM
    About UI-based ingestion. The current datahub-actions image does not install plugins for each data source, and install them at runtime. However, this method does not work in offline environment. Since the datahub-ingestion image has all plugins installed, why not use datahub-ingestion container for UI-based ingestion and datahub-actions triggers datahub-ingestion? If deployed on Kubernetes, it would also be nice to run datahub-ingestion as Pod in kubernetes-client. Basically, using the same version of datahub-ingestion as datahub-gms would be fine.
    g
    • 2
    • 4
1...616263...144Latest