DataHub #ingestion

fresh-battery-23937

05/03/2022, 10:16 PM

hi, I have docker quick start up and running ok, but am unable to ingest from a snowflake datasource. all images show healthy. which container is responsbile for ingest to see the logs?

best-umbrella-24804

05/03/2022, 11:56 PM

Hi, I've got a glue ingestion job that looks like this and it's currently throwing this strange error

best-umbrella-24804

05/04/2022, 12:40 AM

any body ever implement the spark-agent for a glue job? https://datahubproject.io/docs/metadata-integration/java/spark-lineage/#configuring-spark-agent Some sample code would be amazing

billowy-refrigerator-34936

05/04/2022, 1:20 AM

Hello everybody, I'm new with the datahub. I saw the term pull-base ingestion, could anyone explain for me how it work? And so on, I have a case. I have an operational system running separate with datahub system. The only way to reach the data in that system is through the kafka. (All the action in the operational system will be sent to the kafka topic). In this case how to we extract the metadata from kafka message and capture the change of the schema. The fact is that message in kafka topic is a full-document of a record in the operational system's database.

best-umbrella-24804

05/04/2022, 2:04 AM

Hi the following glue recipe works when extract_transform is set to false. Removing the extract_transform config or setting it to true gives the following error. Any advice?

mammoth-fountain-32989

05/04/2022, 7:57 AM

Hi, This has been asked by someone earlier, but couldn't find a response, reposting. I see that that partitions of partitioned postgres tables are being ingested as separate objects. Is there any way or setting to avoid this? (like ignore_partitions etc) Thanks

dazzling-queen-76396

05/04/2022, 8:10 AM

Hey! When I try to exclude some datasets in bigquery-usage ingestion I get this error

Copy code

'1 validation error for BigQueryUsageConfig\n'
           'dataset_pattern\n'
           '  extra fields not permitted (type=value_error.extra)\n',

Here is my recipe

Copy code

source:
    type: bigquery-usage
    config:
        projects:
            - project1
            - project2
        credential:
            project_id: ''
            private_key_id: ''
            private_key: ''
            client_email: ''
            client_id: ''
        top_n_queries: 10
        include_operational_stats: true
        max_query_duration: 45
        dataset_pattern:
            deny:
                - '.*_raw'
                - '.*_dev'
                - '.*_staging.*'
sink:
    type: datahub-rest
    config:
        server: ''

Datahub version is 0.8.32. Could you help to figure out the problem?

cold-hydrogen-10513

05/04/2022, 11:49 AM

hi, I can’t find required permissions for the Snowflake user on the documents page anymore https://datahubproject.io/docs/generated/ingestion/sources/snowflake. Previously there was a list of permissions that should be granted to the user that will be used by Datahub to ingest metadata, lineage, usage, etc. Could you please add it or send me the link to the other page which contains that info? I’m asking because I started getting the error in my DAG (I’m using an airflow 1.12 DAG to import metadata, Datahub version is 0.8.34.1)

Copy code

[2022-05-04 11:34:53,096] {{sql_common.py:496}} WARNING - lineage => Extracting lineage from Snowflake failed.Please check your premissions. Continuing...

agreeable-army-26750

05/04/2022, 1:15 PM

Hi everyone! I would like to develop a feature that is, based on specific user inputs(via a form), capable of generating a yaml configuration file for ingesting glossary terms to datahub. Maybe I am wrong but as I saw, through the provided examples, the ingestible file is separated from the actual glossary configuration like this: There is one file for referencing the real configuration (picture 1), and the real glossary configuration (picture 2) The business_glossary.yml cannot be used as is(picture 2), so we need a wrapper config(picture 1) that points to the file to be able to activate it. I would like to merge this 2 config files into one (like adding a new attribute under file, like config, that actually contains the configuration presented in picture 2) Is there an easy way to enable that? I tried to mess with the python configuration, so it goes thorough the validation with the new field, but lot of files and validations failed after . Can you point me to the right direction? Is it even possible with minimal coding effort? Thank you in advance for helping 🙂

lemon-terabyte-66903

05/04/2022, 1:41 PM

How do I hide/mask sample values of some sensitive fields in profiling tab?

mammoth-fall-12031

05/04/2022, 2:24 PM

I have ingested a mssql database into our datahub setup, however i don't see the lineage data being picked up although it's mentioned that it will be picked up by default. Below is the sample yaml I'm using for ingestion.

Copy code

source:
  type: mssql
  config:
    username: username
    password: password
    database: db_name
    host_port: host_with_port
    profiling:
        enabled: true

# see <https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub> for complete documentation
sink:
  type: "datahub-rest"
  config:
    server: "<http://localhost:8080>"

Anything I need to add to explicitly enable lineage?

bland-orange-13353

05/04/2022, 7:47 PM

This message was deleted.

nutritious-bird-77396

05/04/2022, 8:11 PM

Team...Is it possible to trigger ingestion via API? • Found GraphQL Endpoint

createIngestionExecutionRequest

but not clear how it would fetch the ingestion recipe.. Any thoughts?

millions-waiter-49836

05/04/2022, 9:19 PM

Hi team, about lookerml ingestion, can I ask what’s the reason to report failure than warning for this: https://github.com/datahub-project/datahub/blob/075d19ef166177ececfbb39796de4721bd[…]e9dc1/metadata-ingestion/src/datahub/ingestion/source/lookml.py Sometimes when a view file is included in a lookerml model, it may not exist physically in the

base_folder

, as the lookeml repo could be quite messy in production… I wonder if we can adjust failure report to warning so the lookerml ingestion job won’t show ‘fail’ status everytime

orange-coat-2879

05/04/2022, 11:25 PM

Hi team, for this error, does it mean I just need increase device space? I am doing everything on a VM.

rich-policeman-92383

05/05/2022, 12:49 PM

While specifying hive queue as mentioned here: https://datahubspace.slack.com/archives/CUMUWQU66/p1640608353477600 I am getting below error: TypeError: __init__() got an unexpected keyword argument 'tez.queue.name'

Copy code

File "/IngestionRecipies/dhubv08_19/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 508, in connect
    return self.dbapi.connect(*cargs, **cparams)
File "/IngestionRecipies/dhubv08_19/lib64/python3.6/site-packages/pyhive/hive.py", line 126, in connect
    return Connection(*args, **kwargs)

TypeError: __init__() got an unexpected keyword argument 'tez.queue.name'
[2022-05-05 18:14:40,880] INFO     {datahub.entrypoints:162} - DataHub CLI version: 0.8.31 at /IngestionRecipies/dhubv08_19/lib64/python3.6/site-packages/datahub/__init__.py
[2022-05-05 18:14:40,880] INFO     {datahub.entrypoints:165} - Python version: 3.6.8 (default, Aug 13 2020, 07:46:32) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] at /IngestionRecipies/dhubv08_19/bin/python3 on Linux-3.10.0-1160.25.1.el7.x86_64-x86_64-with-redhat-7.9-Maipo
[2022-05-05 18:14:40,880] INFO     {datahub.entrypoints:167} - GMS config {'models': {}, 'versions': {'linkedin/datahub': {'version': 'v0.8.31', 'commit': '2f078c981c86b72145eebf621230ffd445948ef6'}}, 'managedIngestion': {'defaultCliVersion': '0.8.26.6', 'enabled': True}, 'statefulIngestionCapable': True, 'supportsImpactAnalysis': True, 'telemetry': {'enabledCli': True, 'enabledIngestion': False}, 'retention': 'true', 'noCode': 'true'}

YML

Copy code

---
source:
  type: hive
  config:
    host_port: hive:10000
    env: "PROD"
    database: databaseName
    table_pattern:
      allow:
        - datasetname$
    options:
      connect_args: {'auth': 'KERBEROS','kerberos_service_name': 'hive', 'tez.queue.name': 'root.myqueue'}
    profiling:
      enabled: true
    profile_pattern:
      allow:
        - datasetname
sink:
  type: "datahub-rest"
  config:
    server: "<https://didatahub.airtel.com:8080>"

straight-telephone-84434

05/05/2022, 2:51 PM

I am trying to ingest some biquery data but something is not working in the libraries: This is my yaml:

Copy code

source:
  type: bigquery
  config:
    project_id: "internal-project-bigquery"
    options:
      credentials_path: "./key_file.json"
    table_pattern:
      allow:
      # Allow only one table
      - "bigquery-public-data.chicago_crime.crime"
sink:
  # sink data
  type: "datahub-rest"
  config:
    server: # "sink_server"

This is the error I am getting:

Copy code

AttributeError: module 'pybigquery.sqlalchemy_bigquery' has no attribute 'dialect'

acryl-datahub, version 0.8.28.1

worried-motherboard-80036

05/05/2022, 5:00 PM

is there a way to pass an ssl context for example? Something like this would work:

Copy code

ssl_context = create_ssl_context()
ssl_context.check_hostname = False
ssl_context.verify_mode = ssl.CERT_NONE

es_client = Elasticsearch(
    source_config.host,
    http_auth=source_config.http_auth,
    url_prefix=source_config.url_prefix,
    ssl_context=ssl_context
)

red-pizza-28006

05/05/2022, 5:00 PM

I need little help. I am trying to read column level tags for one of the datasets. Can someone point me to the code on how to do it? Using the existing example i run into this issue

Copy code

AttributeError: 'DataHubGraph' object has no attribute 'get_aspect_v2'

and I cannot see the editableSchemaMetadataClass as well. I am on 0.8.33 version.

lemon-terabyte-66903

05/05/2022, 8:12 PM

Hi, I am trying to update field metadata using editableSchemaMetadata aspect using python emitter. It fails with

Copy code

datahub.configuration.common.OperationalError: ('Unable to emit metadata to DataHub GMS', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:500]: org.apache.kafka.common.errors.SerializationException: Error serializing Avro message

But on UI, I could see the correct change.

billowy-refrigerator-34936

05/06/2022, 8:42 AM

Hi, in my case I want to ingest latest documents from mongodb like 10000 documents. How to I do that with datahub ingestion?

agreeable-army-26750

05/06/2022, 11:06 AM

Hi everyone! Can you help me how to remove all added ingestion sources via command line with one command? Thanks for help in advance!

acoustic-quill-54426

05/06/2022, 11:16 AM

👋 I've found with the new blame view that all of our BigQuery sources have produced a new version without any change in the schema. The issue seems to be related to structs and arrays: the blame view reports a v1.0.0 in the struct, but all fields within it are on v0.0.0. I have not found the same behaviour with other sources (redshift, tableau)

gifted-bird-57147

05/06/2022, 12:43 PM

I'm looking at the ingestion of Business Glossary terms. I follow the example from the documentation. In the example yml the Account term has properties for term_source, source_ref and source_url but I don't see those properties showing up in the UI ? the custom property for Confidential does show up under properties. should the source properties be in custom properties as well, or is there some other configuration error in the UI? I'm running v0.8.34

millions-sundown-65420

05/06/2022, 3:39 PM

Hi team. I am setting up lineage for my Mongo database upstream and downstream entities. I am using 'make_lineage_mce' and DatahubKafkaEmitter emit_mce_async for emitting lineage event via Kafka. I first set upstream and downstream the other way round and I would like to delete the existing lineage for those entities. May I know how to delete those lineages?

millions-sundown-65420

05/06/2022, 6:41 PM

Hi team. I am using datahub.ingestion.run.pipeline.Pipeline.create function with source, sink & transformers. Source is mongodb type and Sink is 'datahub-kafka'. When I do a pipeline.run(), the metadata is emitted for ALL the databases in my source to Datahub. Is it possible to emit metadata only for specific source database rather than all the databases? Because I need to execute this pipeline for every data change in my source database. Thanks.

cuddly-arm-8412

05/07/2022, 12:45 AM

hi,team,I want to know if I extend the metadata-model myself,Do I need to modify the metadata-ingestion-code to assign my custom model or how to adapt。

millions-sundown-65420

05/08/2022, 9:52 AM

Hi team. I can see transformers to add tags, terms & owner. Is there a transformer to add domain to a dataset?

sticky-dawn-95000

05/08/2022, 12:01 PM

Hi. I have a problem to ingest metadata from Oracle database. It cannot collect schema information from Oracle, ‘missing column information’. Here is my log.

Copy code

[mjlee@tkgkdc01:datahub-ingestion]$ datahub ingest -c ./oracle_to_datahub.yml --preview --preview-workunits=1000 --dry-run  
[2022-05-04 16:48:49,925] INFO   {datahub.cli.ingest_cli:88} - DataHub CLI version: 0.8.32.2
[2022-05-04 16:48:49,937] INFO   {datahub.ingestion.sink.datahub_rest:60} - Setting gms config
[2022-05-04 16:48:53,814] INFO   {datahub.cli.ingest_cli:104} - Starting metadata ingestion
/usr/local/lib64/python3.6/site-packages/sqlalchemy/dialects/oracle/base.py:1421: SAWarning: Oracle version (19, 12, 0, 0, 0) is known to have a maximum identifier length of 128, rather than the historical default of 30. SQLAlchemy 1.4 will use 128 for this database; please set max_identifier_length=128 in create_engine() in order to test the application with this new length, or set to 30 in order to assure that 30 continues to be used. In particular, pay close attention to the behavior of database migrations as dynamically generated names may change. See the section 'Max Identifier Lengths' in the SQLAlchemy Oracle dialect documentation for background.
 % ((self.server_version_info,))
/usr/local/lib64/python3.6/site-packages/sqlalchemy/dialects/oracle/base.py:1776: SAWarning: Did not recognize type 'ROWID' of column 'head_rowid'
 % (coltype, colname)
/usr/local/lib64/python3.6/site-packages/sqlalchemy/dialects/oracle/base.py:1776: SAWarning: Did not recognize type 'UROWID' of column 'head_rowid'
 % (coltype, colname)
[2022-05-04 16:48:57,984] INFO   {datahub.cli.ingest_cli:106} - Finished metadata ingestion

Source (oracle) report:
{'cli_entry_location': '/fshome/mjlee/.local/lib/python3.6/site-packages/datahub/__init__.py',
 'cli_version': '0.8.32.2',
 'entities_profiled': 0,
 'failures': {},
 'filtered': [],
 'os_details': 'Linux-4.18.0-193.el8.x86_64-x86_64-with-redhat-8.2-Ootpa',
 'py_exec_path': '/usr/bin/python3',
 'py_version': '3.6.8 (default, Dec 5 2019, 15:45:45) \n[GCC 8.3.1 20191121 (Red Hat 8.3.1-5)]',
 'query_combiner': None,
 'soft_deleted_stale_entities': [],
 'tables_scanned': 155,
 'views_scanned': 43,
 'warnings': {'audsys.aud$unified': ['missing column information'],
       '<http://sys.aq|sys.aq>$_alert_qt_g': ['missing column information'],
       '<http://sys.aq|sys.aq>$_alert_qt_h': ['missing column information'],
       '<http://sys.aq|sys.aq>$_alert_qt_i': ['missing column information'],
       '<http://sys.aq|sys.aq>$_alert_qt_t': ['missing column information'],
       …
         'container-urn:li:container:0c832d6cfb642ca7e447946a0f1d88b6-to-urn:li:dataset:(urn:li:dataPlatform:oracle,<http://sys.aq|sys.aq>$_kupc$datapump_quetab_1_i,PROD)',
         '<http://sys.aq|sys.aq>$_kupc$datapump_quetab_1_i'],
 'workunits_produced': 759}
Sink (datahub-rest) report:
{'downstream_end_time': None,
 'downstream_start_time': None,
 'downstream_total_latency_in_seconds': None,
 'failures': [],
 'gms_version': 'v0.8.29',
 'records_written': 0,
 'warnings': []}

Pipeline finished with warnings

At the result in my DataHub web service, there is no information about table schema. I only have a permission to read Oracle system table(all_check_constraints, all_col_comments, all_cons_columns, all_constraints, all_ind_columns, all_indexes, all_sequences, all_synonyms, all_tab_cols, all_tab_columns, all_tab_comments, all_tab_identity_cols, all_tables, all_users, all_views, all_sequences, dual, nls_session_parameters, user_db_links). Do I need permissions to read more system tables? How do I figure it out? Thanks for help in advance.

cool-architect-34612

05/08/2022, 11:39 PM

Hi, Is there any way to ingest silently? ( ex : verbose=0 ) there are too many logs