https://datahubproject.io logo
Join SlackCommunities
Powered by
# ingestion
  • m

    melodic-dentist-23675

    07/27/2021, 5:27 PM
    Hello everyone! Just started using DataHub and I have a couple of questions regarding a particular use case: I am interesting in doing ingestion of metadata related to schemas stored in a schema registry, but in my case, they are defined using Protobuf, not Avro. As far as I understand, there is no support for doing that using Protobuf. I'd be happy to make a PR for that, which leads me to my questions: • Is my assumption that support for Protobuf is not available? Or maybe I missed something? • If so, is a PR the right way to approach it? Or maybe there is something already in the roadmap for that? (although I couldn't find anything about it). Thanks in advance!
    m
    • 2
    • 1
  • f

    fresh-fish-73471

    07/28/2021, 4:44 AM
    gms service is not starting. it starts and then exits bxxxxxxx8  linkedin/datahub-gms:head                  "/bin/sh -c /datahub…"  30 minutes ago  Exited (255) 7 seconds ago                                                     datahub-gms b
    e
    m
    +2
    • 5
    • 45
  • s

    square-activity-64562

    07/28/2021, 8:26 AM
    If I add ownership using https://datahubproject.io/docs/metadata-ingestion/#simple_add_dataset_ownership will it replace owners in datahub? glue is containing hadoop as owner for some datasets. I would like to remove that during ingestion somehow.
    g
    • 2
    • 2
  • c

    chilly-holiday-80781

    07/28/2021, 9:49 PM
    <!here> Per popular request, we’ve put together a guide on how to write your own transformers! If you’ve got any custom ownerships, tags, or want to modify any metadata before it’s ingested, then this is the guide for you. Please see here for the updated transformer docs and tutorial. We’ve also posted the configs and scripts used in the tutorial here.
    🙌🏻 1
    🙌 9
    c
    • 2
    • 1
  • l

    lemon-receptionist-88902

    07/29/2021, 12:16 PM
    Hi every one.. this is my first time to use datahub. I already finish ingest bigquery and postgres data to datahub locally on my laptop. The doc pretty usefull at this state. But i cannot find solution to ingest those data into datahub in GKE, is there docs/solution for ingest data to GKE? <!here>
    s
    • 2
    • 2
  • p

    prehistoric-yak-75049

    07/29/2021, 8:13 PM
    Hi Everyone, I am looking so any java/Scala implementation or example of data ingestion. I looked at the Maven report for
    datahub
    but didn’t find event dictionary jar/classes. like we have for Python in ingestion module.
    l
    m
    • 3
    • 3
  • p

    polite-flower-25924

    07/29/2021, 8:56 PM
    Hey team, Superset and Airflow metadata ingestion requires username &amp; password with db or ldap provider. However, we can login to these tools through Okta. What should I do in order to run ingestion?
    l
    p
    +2
    • 5
    • 7
  • p

    proud-church-91494

    07/30/2021, 12:59 AM
    Hey team, After add the following lines in my 
    airflow.cfg
     file. (follow this documentation https://datahubproject.io/docs/metadata-ingestion#using-datahubs-airflow-lineage-backend-recommended)
    Copy code
    [lineage]
    backend = datahub_provider.lineage.datahub.DatahubLineageBackend
    datahub_kwargs = {
        "datahub_conn_id": "datahub_rest_default",
        "capture_ownership_info": true,
        "capture_tags_info": true,
        "graceful_exceptions": true }
    Airflow show me this error:
    Copy code
    airflow-webserver_1  | [2021-07-30 00:47:44,154] {configuration.py:468} ERROR - No module named 'datahub_provider'
    airflow-webserver_1  | [2021-07-30 00:47:59,084] {configuration.py:468} ERROR - No module named 'datahub_provider'
    airflow-webserver_1  | [2021-07-30 00:48:13,803] {configuration.py:468} ERROR - No module named 'datahub_provider'
    Are there something to do before these steps? Like "pip installl" something? I'm using Airflow 2.1.2.
    m
    h
    • 3
    • 4
  • s

    salmon-cricket-21860

    07/30/2021, 5:39 AM
    Hi, I am having problem of ingesting LDAP users that has
    .
    in their user name. GMS says it's ingested and posted but they are not searched in elasticsearch and can't find ES index (used Kibana to find them on ES)
    Some users are ingested. their user names don't include
    .
    I am using datahub 0.8.6.
    Solved. After deleting rows in Postgres, was able to index in ES again.
    m
    • 2
    • 6
  • c

    cool-iron-6335

    07/30/2021, 8:52 AM
    I have already checkout the Demo of yours and it's very impressive with DashBoards and Charts. How can i ingest some superset document like this ? That would be greate if you can give me the an example about ingesting superset
    p
    m
    +4
    • 7
    • 16
  • f

    faint-hair-91313

    07/30/2021, 12:45 PM
    Hey guys, I am ingesting source type file where I do not have ownership classes defined as I am trying to overwrite with transformers. But for charts and dashboards it does not work. Using
    Copy code
    transformers:
      - type: "simple_add_dataset_ownership"
    ...
      - type: "simple_add_dataset_tags"
    m
    • 2
    • 3
  • l

    little-van-63930

    07/31/2021, 9:47 PM
    Perhaps a silly question, but are we truly limited to environment options of ’[“DEV”, “EI”, “PROD” or “CORP”]? Is there a way to create additional, more specific environment values directly applicable to our use case?
    m
    b
    e
    • 4
    • 22
  • m

    mysterious-lamp-73086

    08/01/2021, 10:19 AM
    Hi, I try to use sql-common profiling for PostgreSQL. But have an error extra fields not permitted (type=value_error.extra)
    b
    g
    • 3
    • 5
  • m

    most-cricket-43285

    08/02/2021, 10:36 AM
    Hello! I am trying to add a transformer to my ingestion recipe to get ownership and tags from Snowflake into Datahub. When I try it for tags like this: transformer: - type: simple_add_dataset_tags config: tag_urns: - "urnlitag:NeedsDocumentation" - "urnlitag:HeyImATag" I get this error message: 1 validation error for PipelineConfig transformer extra fields not permitted (type=value_error.extra) Hope someone is able to help:)
    m
    • 2
    • 2
  • n

    narrow-policeman-29290

    08/03/2021, 3:45 AM
    Trying to understand what is the difference between
    MetadataChangeProposal
    and
    MetadataChangeEvent
    as both seem to have similar descriptions except for what is emitted after the change occurs:
    Copy code
    class MetadataChangeProposalClass(DictWrapper):
        """Kafka event for proposing a metadata change for an entity. A corresponding MetadataChangeLog is emitted when the change is accepted and committed, otherwise a FailedMetadataChangeProposal will be emitted instead."""
    
    class MetadataChangeEventClass(DictWrapper):
        """Kafka event for proposing a metadata change for an entity. A corresponding MetadataAuditEvent is emitted when the change is accepted and committed, otherwise a FailedMetadataChangeEvent will be emitted instead."""
    m
    • 2
    • 1
  • a

    adventurous-scooter-52064

    08/03/2021, 8:49 AM
    Hi, if i use datahub-kafka as my sink, and im using AWS MSK, do I have to add extra configs on the
    producer_config
    in order to use datahub-kafka?
    g
    e
    +6
    • 9
    • 74
  • c

    colossal-furniture-76714

    08/03/2021, 10:37 AM
    I get the following error:
    Copy code
    "No root resource defined for path '/entities'","status":404}(base)
    If I try to ingest via curl. The other option via acryl addon works...
    Copy code
    curl '<http://localhost:8080/entities?action=ingest>' -X POST --data '{
       "entity":{
          "value":{
             "com.linkedin.metadata.snapshot.DatasetSnapshot":{
                "aspects":[
                   {
                      "com.linkedin.common.Ownership":{
                         "owners":[
                            {
                               "owner":"urn:li:corpuser:fbar",
                               "type":"DATAOWNER"
                            }
                         ],
                         "lastModified":{
                            "time":0,
                            "actor":"urn:li:corpuser:fbar"
                         }
                      }
                   },
                   {
                      "com.linkedin.common.InstitutionalMemory":{
                         "elements":[
                            {
                               "url":"<https://www.linkedin.com>",
                               "description":"Sample doc",
                               "createStamp":{
                                  "time":0,
                                  "actor":"urn:li:corpuser:fbar"
                               }
                            }
                         ]
                      }
                   },
                   {
                      "com.linkedin.schema.SchemaMetadata":{
                         "schemaName":"FooEvent",
                         "platform":"urn:li:dataPlatform:foo",
                         "version":0,
                         "created":{
                            "time":0,
                            "actor":"urn:li:corpuser:fbar"
                         },
                         "lastModified":{
                            "time":0,
                            "actor":"urn:li:corpuser:fbar"
                         },
                         "hash":"",
                         "platformSchema":{
                            "com.linkedin.schema.KafkaSchema":{
                               "documentSchema":"{\"type\":\"record\",\"name\":\"MetadataChangeEvent\",\"namespace\":\"com.linkedin.mxe\",\"doc\":\"Kafka event for proposing a metadata change for an entity.\",\"fields\":[{\"name\":\"auditHeader\",\"type\":{\"type\":\"record\",\"name\":\"KafkaAuditHeader\",\"namespace\":\"com.linkedin.avro2pegasus.events\",\"doc\":\"Header\"}}]}"
                            }
                         },
                         "fields":[
                            {
                               "fieldPath":"foo",
                               "description":"Bar",
                               "nativeDataType":"string",
                               "type":{
                                  "type":{
                                     "com.linkedin.schema.StringType":{
    
                                     }
                                  }
                               }
                            }
                         ]
                      }
                   }
                ],
                "urn":"urn:li:dataset:(urn:li:dataPlatform:foo,bar,PROD)"
             }
          }
       }
    }'
    g
    • 2
    • 19
  • a

    adventurous-scooter-52064

    08/04/2021, 6:49 AM
    using data-ingestion v0.8.7 docker image with glue as source shows the following error.
    Copy code
    self = <datahub.ingestion.api.registry.Registry object at 0x7f10a1cdef70>
         key = 'glue'
         Type = typing.Type
         T = ~T
         self._mapping = {'athena': <class 'datahub.ingestion.source.sql.athena.AthenaSource'>,
                          'bigquery': <class 'datahub.ingestion.source.sql.bigquery.BigQuerySource'>,
                          'bigquery-usage': <class 'datahub.ingestion.source.usage.bigquery_usage.BigQueryUsageSource'>,
                          'dbt': <class 'datahub.ingestion.source.dbt.DBTSource'>,
                          'druid': <class 'datahub.ingestion.source.sql.druid.DruidSource'>,
                          'feast': <class 'datahub.ingestion.source.feast.FeastSource'>,
                          'file': <class 'datahub.ingestion.source.file.GenericFileSource'>,
                          'glue': ModuleNotFoundError("No module named 'mypy_boto3_glue'"),
                          'hive': <class 'datahub.ingestion.source.sql.hive.HiveSource'>,
                          'kafka': <class 'datahub.ingestion.source.kafka.KafkaSource'>,
                          'kafka-connect': <class 'datahub.ingestion.source.kafka_connect.KafkaConnectSource'>,
                          'ldap': <class 'datahub.ingestion.source.ldap.LDAPSource'>,
                          'looker': <class 'datahub.ingestion.source.looker.LookerDashboardSource'>,
                          'lookml': <class 'datahub.ingestion.source.lookml.LookMLSource'>,
                          'mongodb': <class 'datahub.ingestion.source.mongodb.MongoDBSource'>,
                          'mssql': <class 'datahub.ingestion.source.sql.mssql.SQLServerSource'>,
                          'mysql': <class 'datahub.ingestion.source.sql.mysql.MySQLSource'>,...
         tp = ModuleNotFoundError("No module named 'mypy_boto3_glue'")
         ConfigurationError = <class 'datahub.configuration.common.ConfigurationError'>
    .
    .
    .
    
    ConfigurationError: glue is disabled; try running: pip install 'acryl-datahub[glue]'
    m
    c
    • 3
    • 2
  • f

    faint-hair-91313

    08/04/2021, 2:11 PM
    Hi guys, great job with the latest release. I do have a different behavior with lineage between charts and datasets. I see the sources here, but not on the lineage graph.
    m
    g
    b
    • 4
    • 83
  • f

    faint-hair-91313

    08/04/2021, 3:40 PM
    Hi guys, I am running this profiling on Oracle. Tables work neatly, but for view , they don't show up. I do get extra properties where we can see the view's SQL.
    • 1
    • 1
  • f

    fast-leather-13054

    08/05/2021, 1:28 PM
    Hi guys, how I can invalidate all already imported instances in DataHub before new clean install import?
    l
    b
    • 3
    • 2
  • a

    adventurous-scooter-52064

    08/05/2021, 4:12 PM
    Hi, im writing ny own custom transformer and im wondering how can i return a list of Table properties from what i ingested on the source(glue) through a custom config? Is there any class that could i can look through to return each Dataset’s Properties?
    c
    • 2
    • 9
  • n

    narrow-kitchen-1309

    08/05/2021, 5:03 PM
    Hi, guys, I am new to DataHub. I could be able to ingest my own datasets, but I would like to add customized Business Glossary for my schema. I kept get error paramater’s snapshot is required? Please let me know how to ingest business glossary and add to metadata assets.
    g
    m
    • 3
    • 30
  • f

    future-waitress-970

    08/05/2021, 5:56 PM
    Hey everyone, I am trying to ingest the following json file, but when i do, the frontend crashes and i get the following error:
    Caused by: java.net.URISyntaxException: Urn doesn't start with 'urn:'. Urn: at index 0:
     
    at com.linkedin.common.urn.Urn.<init>(Urn.java:80)
    at com.linkedin.common.urn.Urn.createFromString(Urn.java:231)
    at com.linkedin.common.urn.DataPlatformUrn.createFromString(DataPlatformUrn.java:26)
    at com.linkedin.common.urn.DataPlatformUrn$1.coerceOutput(DataPlatformUrn.java:60)
    I already tried nuking datahub several times, fixing things within the file, updating github, etc. Anyone got any tips
    test.json
    g
    • 2
    • 6
  • b

    bland-easter-53873

    08/06/2021, 2:36 PM
    Is there any pattern to follow as I have access to only one schema in the database
    m
    • 2
    • 7
  • f

    future-waitress-970

    08/06/2021, 4:21 PM
    To reiterate the problem, after a succesful sink with the following output:
    Sink (datahub-rest) report:
    {'failures': [], 'records_written': 1, 'warnings': []}
    Pipeline finished successfully
    On the json file attached, I go to the GUI and it crashes, giving me the following error once you dig through the logs:
    Caused by: java.net.URISyntaxException: Urn doesn't start with 'urn:'. Urn: at index 0:
     
    at com.linkedin.common.urn.Urn.<init>(Urn.java:80)
    at com.linkedin.common.urn.Urn.createFromString(Urn.java:231)
    at com.linkedin.common.urn.DataPlatformUrn.createFromString(DataPlatformUrn.java:26)
    at com.linkedin.common.urn.DataPlatformUrn$1.coerceOutput(DataPlatformUrn.java:60)
    test.json
    m
    g
    • 3
    • 20
  • w

    witty-butcher-82399

    08/09/2021, 2:50 PM
    Redshift connector (just as example) sets
    dataPlatform
    as
    redshift
    . This is noted here and here. Since I want to ingest tables from multiple redshift clusters, I would like to differentiate them by having different values for the
    dataPlatform
    . I have thought of changing this with a custom transform, but since
    dataPlatform
    is part of the URN, a custom transform wouldn’t work and so this requires to be managed from the connector itself; please, correct me if I’m wrong. Actually, the model is the one preventing this. Current approach seems to model
    dataPlatform
    as a sort of platform categorization. Is there any plans to model
    dataPlatform
    as platform instances instead?
    s
    b
    +2
    • 5
    • 11
  • c

    curved-jordan-15657

    08/10/2021, 11:06 AM
    Hello. I’ve deployed datahub in our company’s k8s cluster and works in private network. I’ve ingested our athena schemas and tables using datahub cli with “datahub ingest -c athena.yml” command with datahub-rest method. Now i want to use rollback method, but even if i get the runId with “datahub ingest list-runs” method and use that runId with “rollback” method, it says that:
    Copy code
    No entities touched by this run. Double check your run id?
    rolling back deletes the entities created by a run and reverts the updated aspects
    this rollback deleted 0 entities and rolled back 0 aspects
    showing first 0 of 0 aspects reverted by this run
    +-------+---------------+--------------+
    | urn   | aspect name   | created at   |
    +=======+===============+==============+
    +-------+---------------+--------------+
    I know the runId is correct because i’ve used it with “show” method and clearly saw the all tables i’ve ingested. (71 tables). How do i resolve this issue? Thanks in advance!
    m
    g
    b
    • 4
    • 34
  • h

    handsome-football-66174

    08/10/2021, 8:02 PM
    Trying to use this transformer option transformers: - type: "mark_dataset_status" config: removed: true Any configuration to be done?
    m
    • 2
    • 17
  • m

    magnificent-camera-71872

    08/11/2021, 5:10 AM
    Hi all - is there any python documentation for the APIs available in datahub?
    m
    • 2
    • 4
1...8910...144Latest