https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • f

    fresh-battery-23937

    05/03/2022, 10:16 PM
    hi, I have docker quick start up and running ok, but am unable to ingest from a snowflake datasource. all images show healthy. which container is responsbile for ingest to see the logs?
    h
    • 2
    • 6
  • b

    best-umbrella-24804

    05/03/2022, 11:56 PM
    Hi, I've got a glue ingestion job that looks like this and it's currently throwing this strange error
    m
    m
    • 3
    • 3
  • b

    best-umbrella-24804

    05/04/2022, 12:40 AM
    any body ever implement the spark-agent for a glue job? https://datahubproject.io/docs/metadata-integration/java/spark-lineage/#configuring-spark-agent Some sample code would be amazing
    c
    m
    • 3
    • 12
  • b

    billowy-refrigerator-34936

    05/04/2022, 1:20 AM
    Hello everybody, I'm new with the datahub. I saw the term pull-base ingestion, could anyone explain for me how it work? And so on, I have a case. I have an operational system running separate with datahub system. The only way to reach the data in that system is through the kafka. (All the action in the operational system will be sent to the kafka topic). In this case how to we extract the metadata from kafka message and capture the change of the schema. The fact is that message in kafka topic is a full-document of a record in the operational system's database.
    m
    • 2
    • 7
  • b

    best-umbrella-24804

    05/04/2022, 2:04 AM
    Hi the following glue recipe works when extract_transform is set to false. Removing the extract_transform config or setting it to true gives the following error. Any advice?
    h
    • 2
    • 6
  • m

    mammoth-fountain-32989

    05/04/2022, 7:57 AM
    Hi, This has been asked by someone earlier, but couldn't find a response, reposting. I see that that partitions of partitioned postgres tables are being ingested as separate objects. Is there any way or setting to avoid this? (like ignore_partitions etc) Thanks
    h
    • 2
    • 1
  • d

    dazzling-queen-76396

    05/04/2022, 8:10 AM
    Hey! When I try to exclude some datasets in bigquery-usage ingestion I get this error
    Copy code
    '1 validation error for BigQueryUsageConfig\n'
               'dataset_pattern\n'
               '  extra fields not permitted (type=value_error.extra)\n',
    Here is my recipe
    Copy code
    source:
        type: bigquery-usage
        config:
            projects:
                - project1
                - project2
            credential:
                project_id: ''
                private_key_id: ''
                private_key: ''
                client_email: ''
                client_id: ''
            top_n_queries: 10
            include_operational_stats: true
            max_query_duration: 45
            dataset_pattern:
                deny:
                    - '.*_raw'
                    - '.*_dev'
                    - '.*_staging.*'
    sink:
        type: datahub-rest
        config:
            server: ''
    Datahub version is 0.8.32. Could you help to figure out the problem?
    m
    • 2
    • 11
  • c

    cold-hydrogen-10513

    05/04/2022, 11:49 AM
    hi, I can’t find required permissions for the Snowflake user on the documents page anymore https://datahubproject.io/docs/generated/ingestion/sources/snowflake. Previously there was a list of permissions that should be granted to the user that will be used by Datahub to ingest metadata, lineage, usage, etc. Could you please add it or send me the link to the other page which contains that info? I’m asking because I started getting the error in my DAG (I’m using an airflow 1.12 DAG to import metadata, Datahub version is 0.8.34.1)
    Copy code
    [2022-05-04 11:34:53,096] {{sql_common.py:496}} WARNING - lineage => Extracting lineage from Snowflake failed.Please check your premissions. Continuing...
    b
    d
    • 3
    • 7
  • a

    agreeable-army-26750

    05/04/2022, 1:15 PM
    Hi everyone! I would like to develop a feature that is, based on specific user inputs(via a form), capable of generating a yaml configuration file for ingesting glossary terms to datahub. Maybe I am wrong but as I saw, through the provided examples, the ingestible file is separated from the actual glossary configuration like this: There is one file for referencing the real configuration (picture 1), and the real glossary configuration (picture 2) The business_glossary.yml cannot be used as is(picture 2), so we need a wrapper config(picture 1) that points to the file to be able to activate it. I would like to merge this 2 config files into one (like adding a new attribute under file, like config, that actually contains the configuration presented in picture 2) Is there an easy way to enable that? I tried to mess with the python configuration, so it goes thorough the validation with the new field, but lot of files and validations failed after . Can you point me to the right direction? Is it even possible with minimal coding effort? Thank you in advance for helping 🙂
    h
    • 2
    • 2
  • l

    lemon-terabyte-66903

    05/04/2022, 1:41 PM
    How do I hide/mask sample values of some sensitive fields in profiling tab?
    b
    f
    • 3
    • 5
  • m

    mammoth-fall-12031

    05/04/2022, 2:24 PM
    I have ingested a mssql database into our datahub setup, however i don't see the lineage data being picked up although it's mentioned that it will be picked up by default. Below is the sample yaml I'm using for ingestion.
    Copy code
    source:
      type: mssql
      config:
        username: username
        password: password
        database: db_name
        host_port: host_with_port
        profiling:
            enabled: true
    
    # see <https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub> for complete documentation
    sink:
      type: "datahub-rest"
      config:
        server: "<http://localhost:8080>"
    Anything I need to add to explicitly enable lineage?
    b
    • 2
    • 2
  • b

    bland-orange-13353

    05/04/2022, 7:47 PM
    This message was deleted.
    h
    • 2
    • 1
  • n

    nutritious-bird-77396

    05/04/2022, 8:11 PM
    Team...Is it possible to trigger ingestion via API? • Found GraphQL Endpoint
    createIngestionExecutionRequest
    but not clear how it would fetch the ingestion recipe.. Any thoughts?
    h
    b
    • 3
    • 5
  • m

    millions-waiter-49836

    05/04/2022, 9:19 PM
    Hi team, about lookerml ingestion, can I ask what’s the reason to report failure than warning for this: https://github.com/datahub-project/datahub/blob/075d19ef166177ececfbb39796de4721bd[…]e9dc1/metadata-ingestion/src/datahub/ingestion/source/lookml.py Sometimes when a view file is included in a lookerml model, it may not exist physically in the
    base_folder
    , as the lookeml repo could be quite messy in production… I wonder if we can adjust failure report to warning so the lookerml ingestion job won’t show ‘fail’ status everytime
    h
    m
    • 3
    • 11
  • o

    orange-coat-2879

    05/04/2022, 11:25 PM
    Hi team, for this error, does it mean I just need increase device space? I am doing everything on a VM.
    h
    b
    • 3
    • 55
  • r

    rich-policeman-92383

    05/05/2022, 12:49 PM
    While specifying hive queue as mentioned here: https://datahubspace.slack.com/archives/CUMUWQU66/p1640608353477600 I am getting below error: TypeError: __init__() got an unexpected keyword argument 'tez.queue.name'
    Copy code
    File "/IngestionRecipies/dhubv08_19/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 508, in connect
        return self.dbapi.connect(*cargs, **cparams)
    File "/IngestionRecipies/dhubv08_19/lib64/python3.6/site-packages/pyhive/hive.py", line 126, in connect
        return Connection(*args, **kwargs)
    
    TypeError: __init__() got an unexpected keyword argument 'tez.queue.name'
    [2022-05-05 18:14:40,880] INFO     {datahub.entrypoints:162} - DataHub CLI version: 0.8.31 at /IngestionRecipies/dhubv08_19/lib64/python3.6/site-packages/datahub/__init__.py
    [2022-05-05 18:14:40,880] INFO     {datahub.entrypoints:165} - Python version: 3.6.8 (default, Aug 13 2020, 07:46:32) 
    [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] at /IngestionRecipies/dhubv08_19/bin/python3 on Linux-3.10.0-1160.25.1.el7.x86_64-x86_64-with-redhat-7.9-Maipo
    [2022-05-05 18:14:40,880] INFO     {datahub.entrypoints:167} - GMS config {'models': {}, 'versions': {'linkedin/datahub': {'version': 'v0.8.31', 'commit': '2f078c981c86b72145eebf621230ffd445948ef6'}}, 'managedIngestion': {'defaultCliVersion': '0.8.26.6', 'enabled': True}, 'statefulIngestionCapable': True, 'supportsImpactAnalysis': True, 'telemetry': {'enabledCli': True, 'enabledIngestion': False}, 'retention': 'true', 'noCode': 'true'}
    YML
    Copy code
    ---
    source:
      type: hive
      config:
        host_port: hive:10000
        env: "PROD"
        database: databaseName
        table_pattern:
          allow:
            - datasetname$
        options:
          connect_args: {'auth': 'KERBEROS','kerberos_service_name': 'hive', 'tez.queue.name': 'root.myqueue'}
        profiling:
          enabled: true
        profile_pattern:
          allow:
            - datasetname
    sink:
      type: "datahub-rest"
      config:
        server: "<https://didatahub.airtel.com:8080>"
    h
    • 2
    • 1
  • s

    straight-telephone-84434

    05/05/2022, 2:51 PM
    I am trying to ingest some biquery data but something is not working in the libraries: This is my yaml:
    Copy code
    source:
      type: bigquery
      config:
        project_id: "internal-project-bigquery"
        options:
          credentials_path: "./key_file.json"
        table_pattern:
          allow:
          # Allow only one table
          - "bigquery-public-data.chicago_crime.crime"
    sink:
      # sink data
      type: "datahub-rest"
      config:
        server: # "sink_server"
    This is the error I am getting:
    Copy code
    AttributeError: module 'pybigquery.sqlalchemy_bigquery' has no attribute 'dialect'
    acryl-datahub, version 0.8.28.1
    h
    • 2
    • 4
  • w

    worried-motherboard-80036

    05/05/2022, 5:00 PM
    is there a way to pass an ssl context for example? Something like this would work:
    Copy code
    ssl_context = create_ssl_context()
    ssl_context.check_hostname = False
    ssl_context.verify_mode = ssl.CERT_NONE
    
    es_client = Elasticsearch(
        source_config.host,
        http_auth=source_config.http_auth,
        url_prefix=source_config.url_prefix,
        ssl_context=ssl_context
    )
    h
    • 2
    • 9
  • r

    red-pizza-28006

    05/05/2022, 5:00 PM
    I need little help. I am trying to read column level tags for one of the datasets. Can someone point me to the code on how to do it? Using the existing example i run into this issue
    Copy code
    AttributeError: 'DataHubGraph' object has no attribute 'get_aspect_v2'
    and I cannot see the editableSchemaMetadataClass as well. I am on 0.8.33 version.
    h
    • 2
    • 4
  • l

    lemon-terabyte-66903

    05/05/2022, 8:12 PM
    Hi, I am trying to update field metadata using editableSchemaMetadata aspect using python emitter. It fails with
    Copy code
    datahub.configuration.common.OperationalError: ('Unable to emit metadata to DataHub GMS', {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:500]: org.apache.kafka.common.errors.SerializationException: Error serializing Avro message
    But on UI, I could see the correct change.
    h
    s
    • 3
    • 7
  • b

    billowy-refrigerator-34936

    05/06/2022, 8:42 AM
    Hi, in my case I want to ingest latest documents from mongodb like 10000 documents. How to I do that with datahub ingestion?
    h
    • 2
    • 2
  • a

    agreeable-army-26750

    05/06/2022, 11:06 AM
    Hi everyone! Can you help me how to remove all added ingestion sources via command line with one command? Thanks for help in advance!
    h
    • 2
    • 2
  • a

    acoustic-quill-54426

    05/06/2022, 11:16 AM
    👋 I've found with the new blame view that all of our BigQuery sources have produced a new version without any change in the schema. The issue seems to be related to structs and arrays: the blame view reports a v1.0.0 in the struct, but all fields within it are on v0.0.0. I have not found the same behaviour with other sources (redshift, tableau)
    h
    o
    m
    • 4
    • 11
  • g

    gifted-bird-57147

    05/06/2022, 12:43 PM
    I'm looking at the ingestion of Business Glossary terms. I follow the example from the documentation. In the example yml the Account term has properties for term_source, source_ref and source_url but I don't see those properties showing up in the UI ? the custom property for Confidential does show up under properties. should the source properties be in custom properties as well, or is there some other configuration error in the UI? I'm running v0.8.34
    h
    g
    • 3
    • 2
  • m

    millions-sundown-65420

    05/06/2022, 3:39 PM
    Hi team. I am setting up lineage for my Mongo database upstream and downstream entities. I am using 'make_lineage_mce' and DatahubKafkaEmitter emit_mce_async for emitting lineage event via Kafka. I first set upstream and downstream the other way round and I would like to delete the existing lineage for those entities. May I know how to delete those lineages?
    h
    • 2
    • 7
  • m

    millions-sundown-65420

    05/06/2022, 6:41 PM
    Hi team. I am using datahub.ingestion.run.pipeline.Pipeline.create function with source, sink & transformers. Source is mongodb type and Sink is 'datahub-kafka'. When I do a pipeline.run(), the metadata is emitted for ALL the databases in my source to Datahub. Is it possible to emit metadata only for specific source database rather than all the databases? Because I need to execute this pipeline for every data change in my source database. Thanks.
    h
    • 2
    • 2
  • c

    cuddly-arm-8412

    05/07/2022, 12:45 AM
    hi,team,I want to know if I extend the metadata-model myself,Do I need to modify the metadata-ingestion-code to assign my custom model or how to adapt。
    o
    • 2
    • 4
  • m

    millions-sundown-65420

    05/08/2022, 9:52 AM
    Hi team. I can see transformers to add tags, terms & owner. Is there a transformer to add domain to a dataset?
    o
    • 2
    • 6
  • s

    sticky-dawn-95000

    05/08/2022, 12:01 PM
    Hi. I have a problem to ingest metadata from Oracle database. It cannot collect schema information from Oracle, ‘missing column information’. Here is my log.
    Copy code
    [mjlee@tkgkdc01:datahub-ingestion]$ datahub ingest -c ./oracle_to_datahub.yml --preview --preview-workunits=1000 --dry-run  
    [2022-05-04 16:48:49,925] INFO   {datahub.cli.ingest_cli:88} - DataHub CLI version: 0.8.32.2
    [2022-05-04 16:48:49,937] INFO   {datahub.ingestion.sink.datahub_rest:60} - Setting gms config
    [2022-05-04 16:48:53,814] INFO   {datahub.cli.ingest_cli:104} - Starting metadata ingestion
    /usr/local/lib64/python3.6/site-packages/sqlalchemy/dialects/oracle/base.py:1421: SAWarning: Oracle version (19, 12, 0, 0, 0) is known to have a maximum identifier length of 128, rather than the historical default of 30. SQLAlchemy 1.4 will use 128 for this database; please set max_identifier_length=128 in create_engine() in order to test the application with this new length, or set to 30 in order to assure that 30 continues to be used. In particular, pay close attention to the behavior of database migrations as dynamically generated names may change. See the section 'Max Identifier Lengths' in the SQLAlchemy Oracle dialect documentation for background.
     % ((self.server_version_info,))
    /usr/local/lib64/python3.6/site-packages/sqlalchemy/dialects/oracle/base.py:1776: SAWarning: Did not recognize type 'ROWID' of column 'head_rowid'
     % (coltype, colname)
    /usr/local/lib64/python3.6/site-packages/sqlalchemy/dialects/oracle/base.py:1776: SAWarning: Did not recognize type 'UROWID' of column 'head_rowid'
     % (coltype, colname)
    [2022-05-04 16:48:57,984] INFO   {datahub.cli.ingest_cli:106} - Finished metadata ingestion
    
    Source (oracle) report:
    {'cli_entry_location': '/fshome/mjlee/.local/lib/python3.6/site-packages/datahub/__init__.py',
     'cli_version': '0.8.32.2',
     'entities_profiled': 0,
     'failures': {},
     'filtered': [],
     'os_details': 'Linux-4.18.0-193.el8.x86_64-x86_64-with-redhat-8.2-Ootpa',
     'py_exec_path': '/usr/bin/python3',
     'py_version': '3.6.8 (default, Dec 5 2019, 15:45:45) \n[GCC 8.3.1 20191121 (Red Hat 8.3.1-5)]',
     'query_combiner': None,
     'soft_deleted_stale_entities': [],
     'tables_scanned': 155,
     'views_scanned': 43,
     'warnings': {'audsys.aud$unified': ['missing column information'],
           '<http://sys.aq|sys.aq>$_alert_qt_g': ['missing column information'],
           '<http://sys.aq|sys.aq>$_alert_qt_h': ['missing column information'],
           '<http://sys.aq|sys.aq>$_alert_qt_i': ['missing column information'],
           '<http://sys.aq|sys.aq>$_alert_qt_t': ['missing column information'],
           …
             'container-urn:li:container:0c832d6cfb642ca7e447946a0f1d88b6-to-urn:li:dataset:(urn:li:dataPlatform:oracle,<http://sys.aq|sys.aq>$_kupc$datapump_quetab_1_i,PROD)',
             '<http://sys.aq|sys.aq>$_kupc$datapump_quetab_1_i'],
     'workunits_produced': 759}
    Sink (datahub-rest) report:
    {'downstream_end_time': None,
     'downstream_start_time': None,
     'downstream_total_latency_in_seconds': None,
     'failures': [],
     'gms_version': 'v0.8.29',
     'records_written': 0,
     'warnings': []}
    
    Pipeline finished with warnings
    At the result in my DataHub web service, there is no information about table schema. I only have a permission to read Oracle system table(all_check_constraints, all_col_comments, all_cons_columns, all_constraints, all_ind_columns, all_indexes, all_sequences, all_synonyms, all_tab_cols, all_tab_columns, all_tab_comments, all_tab_identity_cols, all_tables, all_users, all_views, all_sequences, dual, nls_session_parameters, user_db_links). Do I need permissions to read more system tables? How do I figure it out? Thanks for help in advance.
    l
    c
    r
    • 4
    • 3
  • c

    cool-architect-34612

    05/08/2022, 11:39 PM
    Hi, Is there any way to ingest silently? ( ex : verbose=0 ) there are too many logs
    h
    g
    • 3
    • 2
1...404142...144Latest