https://datahubproject.io logo
Join Slack
Powered by
# ingestion
  • c

    calm-river-44367

    10/27/2021, 8:54 AM
    I want to insert profiles for a dataset in the stats part of datahub. is there a json file or another solution I can use for an example on how to do what i want. i will appreciate your help
    b
    e
    • 3
    • 10
  • b

    bland-teacher-17190

    10/27/2021, 1:21 PM
    Hi team, are there any more extensive example of emitters? I am trying to get some SQL schema into Datahub, but I can't find good examples on how to construct mce with schemas. All that's in the examples/library are for mces with upstream and downstream info.
    h
    m
    • 3
    • 7
  • b

    blue-holiday-20644

    10/27/2021, 1:39 PM
    Hi- has anyone managed to get lineage ingestion working with AWS Managed Airflow? MWAA configuration options seem quite limited.
    m
    • 2
    • 3
  • c

    curved-jordan-15657

    10/27/2021, 3:38 PM
    Hello team, do you have any airflow dag code to rollback or delete all datasets or any other suggestion? I guess you are planning to delete entities by GraphQL, how about it?
    g
    • 2
    • 12
  • v

    victorious-dream-46349

    10/27/2021, 4:27 PM
    Hi team, for our use case, we are writing a
    terraform provider
    for datahub (and yes, we will opensource once it is matured). Something like below will add dataset for bigquery with schemas defined.
    Copy code
    resource "datahub_dataset" "test" {
      platform = "bigquery"
      name = "testdataset"
      origin = "DEV"
      owner = "dinesh"
      schema_name = "test"
      
      field {
        field_path = "name"
        native_datatype = "String()"
        recursive = false
        nullable = true
      }
    
      field {
        field_path = "address"
        native_datatype = "String()"
        recursive = true
        nullable = false
      }
    
      tags = ["tag1", "tag2"]
      upstreams = [
        datahub_dataset.gcs.id,
        "urn:li:dataset:(urn:li:dataPlatform:gcs,testgcs,DEV)"
      ]
    }
    Question: For this terraform provider, we are using rest-api based results. Will it scale ? Should we use graphql based apis ?
    👍 1
    m
    e
    • 3
    • 5
  • c

    calm-morning-92759

    10/27/2021, 4:42 PM
    Hello everyone, in the Google Cloud there is the possibility to define labels for resources. So far we have not been able to query these labels via the API and save them in datahub. Is there a suitable solution to the problem here? Information on this is very welcome. Thank you very much.
    g
    • 2
    • 16
  • a

    acceptable-vr-75043

    10/27/2021, 10:31 PM
    If we have multiple Snowflake instances and MySQL clusters, how do you or what's the best way to handle naming collisions? Since the dataset URN is
    platform
    (snowflake or mysql),  
    name
     (db.schema.table for snowflake, db.table for mysql) and 
    fabric
    which doesn't include the snowflake/mysql cluster name. (https://datahubproject.io/docs/what/urn/)
    m
    • 2
    • 4
  • a

    agreeable-hamburger-38305

    10/27/2021, 11:34 PM
    I am trying to get the rendered yaml with ingestion-cron enabled, getting this error. Not super familiar with Helm, can someone help? Thanks!!
    Copy code
    >> helm install datahub datahub/datahub --set-string datahub-ingestion-cron.enabled=true --dry-run
    >> dependencies.go:49: Warning: Condition path 'datahub-ingestion-cron.enabled' for chart datahub-ingestion-cron returned non-bool value
    e
    l
    i
    • 4
    • 23
  • r

    rhythmic-sundown-12093

    10/28/2021, 8:34 AM
    Hi, I have two tables related to foreign keys in mysql, I imported them to datahub but their relationship is not shown on the page Actually there is foreign key data inside the metadata_aspect_v2 table in datahub database I put the process I executed in the log.txt file, please check @loud-island-88694 @little-megabyte-1074
    log.txt
    g
    • 2
    • 8
  • r

    red-pizza-28006

    10/28/2021, 9:17 AM
    Trying to understand something - If I ingest all users from Azure AD using this recipe - https://datahubproject.io/docs/metadata-ingestion/source_docs/azure-ad without enabling SSO, will those users be able to login into Datahub as well?
    b
    b
    • 3
    • 2
  • d

    damp-ambulance-34232

    10/28/2021, 11:26 AM
    How to remove kudu table from hive.yml ingestiton file
    m
    • 2
    • 3
  • r

    red-pizza-28006

    10/28/2021, 2:37 PM
    Another question regarding setting up OIDC using Azure AD - it asks for a discovery URL for Azure AD any ideas what it would be? from this page it shows the path to be https://docs.microsoft.com/en-us/azure/active-directory/develop/v2-protocols-oidc
    Copy code
    /.well-known/openid-configuration
    but what would be the URL?
    e
    b
    • 3
    • 10
  • s

    square-painting-93399

    10/28/2021, 4:50 PM
    Hello all, I am a bit confused on updating an entity. My organization is currently running our datahub instance on a k8 cluster. I initially ingested our aws athena data via a recipe file. We had a table schema update and I would like to update it in datahub. When I go to rerun the same recipe file, via the "datahub ingest -c" command I am getting errors. Is this the practice that I should be doing? Should I be deleting the entity beforehand? I tried searching the documentation for awhile, and I was not able to find it. Apologies if it's there 🙂
    e
    • 2
    • 4
  • a

    acceptable-greece-56919

    10/28/2021, 5:19 PM
    Hello everyone, Does Datahub supports a plugin for Presto Version 347? The Superset Plugin supports authentication using Open ID? Thank you very much !
    l
    b
    +2
    • 5
    • 11
  • d

    damp-ambulance-34232

    10/29/2021, 3:59 AM
    Got the error when ingestion from superset. I can ingest some dashboard but not all
    Copy code
    File "/usr/local/lib/python3.6/dist-packages/datahub/entrypoints.py", line 91, in main
        sys.exit(datahub(standalone_mode=False, **kwargs))
    File "/usr/lib/python3/dist-packages/click/core.py", line 722, in __call__
        return self.main(*args, **kwargs)
    File "/usr/lib/python3/dist-packages/click/core.py", line 697, in main
        rv = self.invoke(ctx)
    File "/usr/lib/python3/dist-packages/click/core.py", line 1066, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
    File "/usr/lib/python3/dist-packages/click/core.py", line 1066, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
    File "/usr/lib/python3/dist-packages/click/core.py", line 895, in invoke
        return ctx.invoke(self.callback, **ctx.params)
    File "/usr/lib/python3/dist-packages/click/core.py", line 535, in invoke
        return callback(*args, **kwargs)
    File "/usr/local/lib/python3.6/dist-packages/datahub/cli/ingest_cli.py", line 58, in run
        pipeline.run()
    File "/usr/local/lib/python3.6/dist-packages/datahub/ingestion/run/pipeline.py", line 125, in run
        for wu in self.source.get_workunits():
    File "/usr/local/lib/python3.6/dist-packages/datahub/ingestion/source/superset.py", line 339, in get_workunits
        yield from self.emit_dashboard_mces()
    File "/usr/local/lib/python3.6/dist-packages/datahub/ingestion/source/superset.py", line 250, in emit_dashboard_mces
        dashboard_data
    File "/usr/local/lib/python3.6/dist-packages/datahub/ingestion/source/superset.py", line 213, in construct_dashboard_from_api_data
        position_data = json.loads(raw_position_data)
    File "/usr/lib/python3.6/json/__init__.py", line 348, in loads
        'not {!r}'.format(s.__class__.__name__))
    
    TypeError: the JSON object must be str, bytes or bytearray, not 'NoneType'
    m
    • 2
    • 5
  • h

    handsome-belgium-11927

    10/29/2021, 8:19 AM
    Hello everyone! Is there any information on how to construct a field urn? I'm trying to ingest foreign keys, and error says that field urn is invalid, I haven't seen any reference to it before.
    g
    • 2
    • 2
  • t

    tall-controller-60779

    10/29/2021, 11:32 AM
    Hello everyone. Is it possible to ingest DataJobSnapshot via POST request? It works successfully for datasets and other entities, but with DataJobSnapshot always receive "union type is not backed by a DataMap or null" error.
    g
    q
    • 3
    • 19
  • n

    nice-planet-17111

    10/29/2021, 12:51 PM
    Hi everyone, I want to know what queries are behind
    SQL_Profiling
    .. Does anyone know where can i find related logs or does anyone know what queries are behind it? 🙂
    f
    m
    l
    • 4
    • 7
  • b

    better-orange-49102

    10/29/2021, 1:05 PM
    i noticed that inside class DatasetProfile in schema_classes, there are 2 attributes, eventGranularity and partitionSpec, but they dont seem to be populated when i run a data profiling job on a postgres table. Is it actually in use, or is it "for future use"? Cos im trying to implement my own profiler using pandas-profiler with a JDBC source
    g
    h
    • 3
    • 4
  • d

    damp-ambulance-34232

    11/01/2021, 8:05 AM
    Hi, when a new table show up in my database, did datahub auto ingestion my new table to datahub?
    b
    • 2
    • 1
  • d

    dazzling-notebook-2883

    11/01/2021, 9:50 AM
    Hello everyone, I was wondering if there is any community effort ongoing to import ontology in W3C format (Turtle, JSON-LD) to DataHub Business Glossary?
    g
    • 2
    • 6
  • d

    damp-minister-31834

    11/01/2021, 11:04 AM
    Hi all. I found that there is a 【Queries】button in dataset but the color is always grey even if I made some queries. So how to make the buttle bright?(I use hive source as test)
    b
    g
    • 3
    • 8
  • r

    red-pizza-28006

    11/01/2021, 3:54 PM
    I am looking to ingest data from Salesforce, I noticed that they have an API to describe all objects within Salesforce, and the response looks like this
    Copy code
    {
      "size": 112,
      "totalSize": 112,
      "done": true,
      "queryLocator": null,
      "entityTypeName": "FieldDefinition",
      "records": [
        {
          "attributes": {
            "type": "FieldDefinition",
            "url": "/services/data/v53.0/tooling/sobjects/FieldDefinition/MessagingSession.Id"
          },
          "DataType": "Lookup()",
          "Description": null
        },
        {
          "attributes": {
            "type": "FieldDefinition",
            "url": "/services/data/v53.0/tooling/sobjects/FieldDefinition/MessagingSession.Owner"
          },
          "DataType": "Lookup(User,Group)",
          "Description": null
        },
        {
          "attributes": {
            "type": "FieldDefinition",
            "url": "/services/data/v53.0/tooling/sobjects/FieldDefinition/MessagingSession.IsDeleted"
          },
          "DataType": "Checkbox",
          "Description": null
        },
        {
          "attributes": {
            "type": "FieldDefinition",
            "url": "/services/data/v53.0/tooling/sobjects/FieldDefinition/MessagingSession.Name"
          },
          "DataType": "Auto Number",
          "Description": null
        }
      ]
    }
    As you can see the DataType is not similar to what we have in other languages. Would datahub be able to handle this if I manually ingest this by putting this in a ingestable file?
    g
    • 2
    • 24
  • c

    curved-jordan-15657

    11/01/2021, 5:03 PM
    Hi team, i have a question about storing metadatas into mysql. Why datahub stores pipeline metadatas by keeping all versions instead of overriding? We have so many dags and some of them are running every 5 min, 10 min etc. I’ve checked the mysql and saw that we have 1500 version number in specific task created in 4 days
    e
    g
    b
    • 4
    • 28
  • i

    important-camera-38424

    11/01/2021, 7:46 PM
    Hello everyone. I'm excited about the potential of DataHub. I successfully ran ingestion from glue but find no lineage for most of our jobs. The curious thing is that under Platforms, I see two boxes: Glue with 315 objects and S3 with 4. Only the ones listed as S3 give me lineage. None of the Glue jobs do. All our sources and targets are in S3. Is there a setting or config required to extract lineage from Glue jobs? Any hints would be apprecaited.
    b
    m
    +2
    • 5
    • 31
  • v

    victorious-dream-46349

    11/02/2021, 11:03 AM
    Please refer this question related to ingestion using rest api
    l
    • 2
    • 2
  • w

    witty-butcher-82399

    11/02/2021, 5:31 PM
    Hi! I haven’t found any mention to partition columns in the SchemaField metadata or any other schema-related PDL https://github.com/linkedin/datahub/blob/master/metadata-models/src/main/pegasus/com/linkedin/schema/SchemaField.pdl Is there any plan to include such info in the model?
    s
    m
    • 3
    • 4
  • d

    damp-minister-31834

    11/03/2021, 3:26 AM
    Hi all! I want to ask something about how to ingest metadata automatically. I ingest hive source using
    datahub ingest -c hive_to_rest.yml
    . But when my data updated in hive, I need to run the command again to update the metadata in datahub? If I want continuous update once ingested, what should I do?
    g
    • 2
    • 4
  • s

    sparse-planet-56664

    11/03/2021, 8:36 AM
    Has anyone tried ingesting thoughtspot data?
    g
    • 2
    • 2
  • a

    acceptable-eye-63357

    11/03/2021, 8:38 AM
    I see iceberg table support is on the roadmap for this quarter. Any idea how it’s going and when it will be available? Also what is planned for Great Expectations?
    l
    • 2
    • 1
1...161718...144Latest