https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • b

    brief-insurance-68141

    09/23/2021, 9:55 PM
    thrift.transport.TTransport.TTransportException: Could not start SASL: b’Error in sasl_client_start (-4) SASL(-4): no mechanism available: No worthy mechs found’
    m
    m
    +3
    • 6
    • 31
  • a

    adamant-pharmacist-61996

    09/24/2021, 2:38 AM
    hey everyone! 👋 In our airflow instance we run subdags within one main dag in order to manage dependencies between workflows. We’re noticing that this results in a broken lineage between parent dag, and the sub-dag. Has anyone noticed this before and found a work around?
    h
    l
    • 3
    • 3
  • b

    brave-market-65632

    09/24/2021, 5:03 AM
    Business glossary question: Thanks for the demo and PR. It was great! In the demo video, there was a subtle point about keeping the glossary configuration in a single yaml file vs splitting them into multiple files as long as the tree structure is preserved. This means if the following is the structure in one file
    Copy code
    node 1
    		> term 1
    		> term 2
    		node 2
    			> term a
    			> term b
    and if one were to introduce a new node and terms collection at an arbitrary location on the tree the file should be defined like this. Is this a fair assumption?
    Copy code
    node 1
    		> term 1
    		> term 2
    		node 2
    			node 3
    				> term c
    				> term d
    This did work for me. Wondering if I'm missing something here. This meant that I had to repeat the name and description configs for the nodes. One could write an abstraction to generate the yaml file. Would it make sense to simply have a parent_urn or something like that in the nodes config to make it canonical to attach a node and term collection at any point in the tree? Something like
    Copy code
    node 3
    				parent_node: node 1.node 2 or simply node 2
    				> term c
    				> term d
    Thanks!
    m
    • 2
    • 2
  • m

    mysterious-monkey-71931

    09/24/2021, 5:41 AM
    Hello. We already have debezium to CDC from different datasources (mysql, postgres, mssql,...) Can we reuse kafka schema-registry or
    dbhistory
    topics for datahub?
    m
    • 2
    • 5
  • w

    witty-keyboard-20400

    09/24/2021, 8:37 AM
    I'm new to DataHub, it looks very promising for metadata management. I want to take the path "file to datahub (REST)" yml config. However, I'm clueless what different fields mean. Is there any sort of doc with example to get me up to speed here? I'd really appreciate this help.
    b
    w
    • 3
    • 7
  • a

    adventurous-scooter-52064

    09/25/2021, 6:53 AM
    Hi, I’m using AWS Glue Schema Registry, and I’m wondering what should I put here for my datahub-kafka’s
    connection.schema_registry_url
    ? 😢 https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub#config-details-1
    e
    • 2
    • 1
  • n

    nice-planet-17111

    09/27/2021, 1:31 AM
    Hello, i'm a noob to datahub. I've deployed datahub on GKE, and i'm trying to ingest bigquery metadata via
    datahub-rest
    . The app (datahub) and bigquery are on same privatea project. When i try sink through console or sink through file, it succeeds without error . However, sink through datahub-rest fails with
    ConnectionError
    ☹️ Is there something i'm missing? Here's my recipe...
    Copy code
    source:
      type: bigquery
      config:
        project_id: <my_project_id>
    
    sink:
      type: "datahub-rest"
      config:
        server: "<http://localhost:8080>"
    error message :
    Copy code
    ConnectionError: HTTPConnectionPool(host='localhost', port=8080): Max retries exceeded with url: /config (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10ba85d00>: Failed to establish a new connection: [Errno 61] Connection refused'
    m
    • 2
    • 3
  • n

    nice-planet-17111

    09/27/2021, 8:32 AM
    Hi, another newbie question here 😂 Is there a way to automatically upsert metadata, detecting only the changed part? I'm trying to ingest bigquery metadata via datahub-rest. Since several people are using the same project, it is hard to know exactly which part of the dataset is modified and when. What I want is to only update the changed part eventhough i don't define anything (like, specific table... ) in the recipe. (using airflow or etc.) Optimally, whenever change occurs in the data source, i want datahub to automatically upsert the change. Is there a way i can do this ? 🙂
    s
    l
    l
    • 4
    • 9
  • b

    bumpy-activity-74405

    09/27/2021, 11:11 AM
    Hey, how do you people deal with datasets deleted in source after they’ve already been ingested with a previous run? I am trying to figure out how to automate the process - I was thinking of maybe running some job that would compare what is already ingested to what I would be ingesting and sending mce’s for the diff items with a status aspect where
    removed=true
    . Curious to know what if anyone had success with this or any other approach.
    s
    l
    • 3
    • 3
  • s

    stocky-noon-61140

    09/27/2021, 12:56 PM
    Hi everyone - I'm looking for a description of the business glossary file format. In particular, I would like to know which relationships types I can specify among business terms. The example yml file provided only contains the relationship elements "contains" and "inherits" (https://github.com/linkedin/datahub/blob/master/metadata-ingestion/examples/bootstrap_data/business_glossary.yml). My goal is to specify e.g. that "Glossary Term A" RELATES TO "Glossary Term B".
    l
    g
    • 3
    • 11
  • b

    bland-orange-13353

    09/27/2021, 7:46 PM
    This message was deleted.
    m
    a
    • 3
    • 6
  • a

    astonishing-lunch-91223

    09/27/2021, 8:44 PM
    OK, let me try this one more time… I have the following metadata ingestion config that I’m trying to run via the
    linkedin/datahub-ingestion
    container (
    ingest -c /workspace/data_recipe.yml
    ):
    Copy code
    source:
      type: "file"
      config:
        filename: "/workspace/bootstrap_mce.json"
    sink:
      type: "datahub-rest"
      config:
        server: '<http://localhost:8080>'
    and I’m using this `bootstrap_mce.json`: https://github.com/linkedin/datahub/blob/master/metadata-ingestion/examples/mce_files/bootstrap_mce.json with DataHub version v0.8.14. Any ideas why I’m getting the errors from the attached log? Basically I’m hitting that
    No root resource defined for path '/corpUsers'
    issue again.
    err.log
    b
    • 2
    • 6
  • a

    adventurous-scooter-52064

    09/28/2021, 6:24 AM
    Anyone here is using AWS Athena with SQL Profile? How are you guys using it? We just can’t find a way to go around SQL Profiles on big tables in AWS Athena 😞
    s
    l
    +2
    • 5
    • 7
  • n

    numerous-cricket-19689

    09/28/2021, 6:44 PM
    I am newbie in this space and one question i had is how can i implement RDBMS schema ingestion using push model. Ex the https://datahubproject.io/docs/metadata-ingestion document talks about how it can scan mysql database and publish all the databases, tables, schema... Can it generate a change log (ex. column x is added to table A) or i will have implement something myself. Ex. listen to schema changes in mysql using tools like debezium and when i receive a new event for say schema change use it for publishing to datahub. I really like the datahub project so thank you for creating this wonderful technology thanks
    h
    m
    • 3
    • 4
  • r

    rough-eye-60206

    09/28/2021, 8:52 PM
    Hello, i am new to datahub and i was trying to ingest data(a local file) to http://localhost:9002/ but i am getting the following error. Can someone please help me.
    Copy code
    File "/Users/vn0d5ac/Library/Python/3.7/lib/python/site-packages/datahub/emitter/rest_emitter.py", line 94, in test_connection
        f"This version of {__package_name__} requires GMS v0.8.0 or higher"
    
    ValueError: This version of acryl-datahub requires GMS v0.8.0 or higher
    g
    b
    • 3
    • 12
  • b

    brief-insurance-68141

    09/28/2021, 11:04 PM
    Looks like cronjob in datahub does not removed tables that were dropped in source database
    m
    • 2
    • 6
  • s

    sparse-energy-27188

    09/29/2021, 2:02 AM
    Hey, been trying to use the datahub-ingestion docker image to ingest and kept getting a lot of weird errors about the recipe being invalid in ways that contradicted the documentation. I just figured out that the "latest" tag on the image is 4 months old. It would be good if someone updated the latest tag to v0.8.14 so others don't experience the same problems.
    e
    b
    • 3
    • 3
  • b

    breezy-guitar-97226

    09/30/2021, 10:48 AM
    Hi here, we are currently using the add_dataset_browse_path transformation to add custom browse paths to our ingested datasets. At the same time though we would like to prevent the canonical ingestion path from being used, by removing it from the ingested object. We are going to achieve this with our own transformer, but I was wondering if such a feature could be also a useful contribution, by making it an option to the current ingestor by setting a flag (ie.
    remove_existing_browse_paths: true
    ) Thanks!
    w
    m
    • 3
    • 2
  • r

    red-smartphone-15526

    09/30/2021, 11:34 AM
    Hey! working on a dbt -> datahub ingestion. Is there anyway to exclude ephemeral models to show up in the dataset list? (but still include them in lineage)?
    m
    l
    +2
    • 5
    • 5
  • a

    adorable-portugal-3397

    09/30/2021, 1:48 PM
    Hi, running a simple datahub ingest -c <path to yml file>, but get the following error. Source is a json file. Has anyone faced such issues? Datahub is running on k8s, locally works just fine. {'error': 'Unable to emit metadata to DataHub GMS', 'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException', 'message': "All action methods (specified via 'action' in URI) must be submitted as a POST (was GET)", 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status400] All action methods (specified via ' "'action' in URI)
    m
    • 2
    • 3
  • c

    chilly-nail-87894

    09/30/2021, 6:19 PM
    This polly is closed. @little-megabyte-1074 has a polly for you!
    l
    c
    +3
    • 6
    • 10
  • p

    polite-flower-25924

    10/02/2021, 8:51 AM
    Hey team, I’m very pleased that
    redshift-usage
    statistics is added with this PR at v0.8.15. This connector requires Redshift Super User privileges to run this query with
    svv_table_info
    ,
    svl_user_info
    tables and also explore other user queries. What’s the approach you follow to pass a Redshift super user to this connector? I’m not sure that data platform team allows us to use super user credentials in a connector. I guess, @witty-state-99511 can give better suggestions here 🙂
    m
    q
    • 3
    • 8
  • t

    tall-controller-60779

    10/04/2021, 11:29 AM
    Hi. We've setup datahub with LDAP authentication. Then I launched recipe for data ingestion. According to logs it was completed successfully. I can even query these new objects using graphiql. But in UI I don't see any objects at all. Do you have any ideas why?
    m
    b
    k
    • 4
    • 9
  • w

    witty-keyboard-20400

    10/04/2021, 1:51 PM
    Is there a way to just clean the sample data (bootstrap_mce.json) so that I could modify and ingest it cleanly? I've been using
    datahub docker nuke
    . But this removes all the containers and subsequent
    datahub docker quickstart
    results into pulling all the containers again over the network.
    b
    p
    • 3
    • 10
  • w

    witty-keyboard-20400

    10/04/2021, 2:12 PM
    When I directly execute
    datahub ingest -c  sample.yml
    which is pointing to the latest checked out bootstrap_mce.json, I see only 4 records are ingested.
    Copy code
    [user@localhost datahub]$ datahub ingest -c ./metadata-ingestion/examples/mce_files/sample.yml 
    [2021-10-04 19:35:09,834] INFO     {datahub.cli.ingest_cli:57} - Starting metadata ingestion
    [2021-10-04 19:35:09,858] INFO     {datahub.ingestion.run.pipeline:61} - sink wrote workunit file://./sample.json:0
    [2021-10-04 19:35:09,886] INFO     {datahub.ingestion.run.pipeline:61} - sink wrote workunit file://./sample.json:1
    [2021-10-04 19:35:09,907] INFO     {datahub.ingestion.run.pipeline:61} - sink wrote workunit file://./sample.json:2
    [2021-10-04 19:35:09,964] INFO     {datahub.ingestion.run.pipeline:61} - sink wrote workunit file://./sample.json:3
    [2021-10-04 19:35:09,964] INFO     {datahub.cli.ingest_cli:59} - Finished metadata ingestion
    
    Source (file) report:
    {'failures': {},
     'warnings': {},
     'workunit_ids': ['file://./sample.json:0', 'file://./sample.json:1', 'file://./sample.json:2', 'file://./sample.json:3'],
     'workunits_produced': 4}
    Sink (datahub-rest) report:
    {'failures': [], 'records_written': 4, 'warnings': []}
    
    Pipeline finished successfully
    My sample.yml is :
    Copy code
    source:
      type: "file"
      config:
        filename: "./sample.json"
    
    # see <https://datahubproject.io/docs/metadata-ingestion/sink_docs/datahub> for complete documentation
    sink:
      type: "datahub-rest"
      config:
        server: "<http://localhost:8080>"
    However, when I execute the datahub docker ingest-sample-data I see there are 82 records ingested:
    Copy code
    ...
                      'file:///var/folders/89/c3061_k547b_g6tgh77dbxsm0000gp/T/tmpnczwmxw4.json:79',
                      'file:///var/folders/89/c3061_k547b_g6tgh77dbxsm0000gp/T/tmpnczwmxw4.json:80',
                      'file:///var/folders/89/c3061_k547b_g6tgh77dbxsm0000gp/T/tmpnczwmxw4.json:81'],
     'workunits_produced': 82}
    Sink (datahub-rest) report:
    {'failures': [], 'records_written': 82, 'warnings': []}
    What is the difference between manually ingesting against the bootstrap_mce.json vs the
    ingest-sample-data
    command? @mammoth-bear-12532 @big-carpet-38439
    l
    • 2
    • 4
  • w

    witty-keyboard-20400

    10/04/2021, 3:58 PM
    In the bootstrap_mce.json, SampleKafkaDataset --> SchemaMetadata --> fields, the 1st field definition is
    Copy code
    "fields": [
                    {
                      "fieldPath": "[version=2.0].[type=boolean].field_foo_2",
                      "jsonPath": null,
                      "nullable": false,
                      "description": {
                        "string": "Foo field description"
                      },
                      "type": {
                        "type": {
                          "com.linkedin.pegasus2avro.schema.BooleanType": {}
                        }
                      },
                      "nativeDataType": "varchar(100)",
                      "globalTags": {
                        "tags": [{ "tag": "urn:li:tag:NeedsDocumentation" }]
                      },
                      "recursive": false
                    },
    ....
    ]
    I checked the type for fieldPath, it's declared just as:
    Copy code
    fieldPath: SchemaFieldPath
    ..and SchemaFieldPath is defined as:
    Copy code
    typeref SchemaFieldPath = string
    Question: Is there any significance of mentioning
    version
    and
    type: boolean
    in the SchemaFieldPath:
    "[version=2.0].[type=boolean].field_foo_2"
    ?
    m
    • 2
    • 1
  • n

    nice-planet-17111

    10/05/2021, 6:01 AM
    Hi, does anyone know how to define credentials in recipe file or handle permission error when ingesting from
    bigquery-usage
    ? •
    options.credentials_path
    , or
    extra_client_options.credentials_path
    does not work ( fails to run the file ->
    got an unexpected keyword arguement
    or
    extra fields not permitted
    ) • I tried
    export GOOGLE_APPLICATION_CREDENTIALS
    -> file runs but it stops with the error :
    the caller does not have permission
    • bigquery ingestion under same environment & configs works without errors.
    m
    b
    • 3
    • 8
  • w

    witty-keyboard-20400

    10/05/2021, 8:03 AM
    Question on nativeDataType. In the file test_serde_large.json, I see
    Copy code
    "nativeDataType": "INTEGER(unsigned=True)"
    while in the glue_mces_golden.json, I see
    Copy code
    "nativeDataType": "int",
    Does nativeDataType attribute refer to the data type natively supported by the source systems? OR, are both the formats supported by DataHub ?
    g
    • 2
    • 4
  • w

    witty-keyboard-20400

    10/05/2021, 3:32 PM
    Question on Upstream lineage (UpstreamLineage): In the bootstrap_mce.json, I see that UpstreamLineage is defined at DatasetSnapshot level.
    Copy code
    {
      "com.linkedin.pegasus2avro.dataset.UpstreamLineage": {
        "upstreams": [
          {
            "auditStamp": {
              "time": 1581407189000,
              "actor": "urn:li:corpuser:jdoe",
              "impersonator": null
            },
            "dataset": "urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD)",
            "type": "TRANSFORMED"
          }
        ]
      }
    }
    Shouldn't there be a feature to track lineage at field level? @green-football-43791 @big-carpet-38439
    g
    h
    • 3
    • 13
  • r

    rough-eye-60206

    10/05/2021, 10:50 PM
    Hello, I am new to datahub and currently able to ingest metadata from hive. Can someone guide me or provide me an example/documentation on how to ingest the metadata description for the tables/columns.
    l
    m
    • 3
    • 12
1...131415...144Latest