https://datahubproject.io logo
Join SlackCommunities
Powered by
# getting-started
  • b

    big-carpet-38439

    09/21/2021, 4:17 PM
    so close to 1300 🤩
    🔥 8
    📈 8
    🎆 4
    🚀 7
    d
    • 2
    • 2
  • m

    mammoth-bear-12532

    09/22/2021, 3:44 PM
    Folks: maybe @calm-sunset-28996 -> how are you securing frontend and gms today (from outside)? There are a few people interested in learning more. 🧵
    s
    p
    +2
    • 5
    • 10
  • b

    bland-orange-13353

    09/22/2021, 6:57 PM
    This message was deleted.
    l
    • 2
    • 1
  • b

    brief-insurance-68141

    09/22/2021, 7:00 PM
    I setup the k8s datahub using helm https://github.com/acryldata/datahub-helm. I can run the service successfully in my cluster. Pods are running healthy. How can I connect to my hive tables? Where can I find the documentation to setup connect datahub to hive or ingest my hive metadata to datahub?
    l
    b
    g
    • 4
    • 12
  • s

    silly-umbrella-20605

    09/22/2021, 8:34 PM
    @little-megabyte-1074 is Acryldata offering is a commerical SaaS version of Datahub open source?
    l
    • 2
    • 1
  • n

    nutritious-bird-77396

    09/23/2021, 3:53 PM
    How do people handle the reporting asks currently? For Ex: Give me a report of all
    datasets
    /
    models
    specific to a business domain?
    q
    l
    • 3
    • 2
  • w

    witty-keyboard-20400

    09/24/2021, 8:44 AM
    After executing the following successfully: datahub ingest -c ./examples/recipes/file_to_datahub_rest.yml ... I still don't see anything at all here: http://localhost:9002/browse/dataset
    b
    b
    • 3
    • 4
  • w

    witty-keyboard-20400

    09/27/2021, 5:13 AM
    How do we specify relationship between 2 entities in the JSON data file?
    g
    • 2
    • 1
  • b

    brief-cricket-98290

    09/27/2021, 10:31 AM
    Hi everyone, apologies for the beginners’ questions as I am just starting my data governance journey: • Is there a Vertica plugin for metadata ingestion? • If not, what would be a good starting point for making a custom ingestion plugin? This documentation page I suppose? • Is it possible to ingest Kafka metadata about consumers, namely consumer groups and list of the consumers? Thanks!
    l
    h
    c
    • 4
    • 5
  • w

    witty-keyboard-20400

    09/27/2021, 1:53 PM
    what command could I execute to list down all the services (Kafka, ES, MySQL etc.) DataHub is maintaining?
    b
    • 2
    • 1
  • a

    acceptable-architect-70237

    09/27/2021, 11:11 PM
    Hi Datahub team, by the time of ES as graph db implementations, did you have some kind of performance comparison between ES as GraphDB and Neo4j? Or the ration of replacing Neo4j with ES is mainly from the consideration of simplicity of tech stack, and possible cost reason (neo4j). I did a quick comparison by comparing the time used of executing
    GraphService
    for neo4j and es separately, I didn't see there is obviously performance gain if with ES.
    l
    • 2
    • 2
  • w

    witty-keyboard-20400

    09/28/2021, 10:23 AM
    How do I read out each line with logical interpretation? I've taken the top section from examples/mce_files/bootstrap_mce.json file. E.g., what is the meaning of
    "urn": "urn:li:corpuser:datahub"
    ? Considering that URNs are in the format
    urn:<Namespace>:<Entity Type>:<ID>
    , where is the entity type
    corpuser
    defined? What is the significance of the "Snapshot" suffix, because there is no timestamp field in this entire section? { "auditHeader": null, "proposedSnapshot": { "com.linkedin.pegasus2avro.metadata.snapshot.CorpUserSnapshot": { "urn": "urnlicorpuser:datahub", "aspects": [ { "com.linkedin.pegasus2avro.identity.CorpUserInfo": { "active": true, "displayName": { "string": "Data Hub" }, "email": "datahub@linkedin.com", "title": { "string": "CEO" }, "managerUrn": null, "departmentId": null, "departmentName": null, "firstName": null, "lastName": null, "fullName": { "string": "Data Hub" }, "countryCode": null } } ] } }, "proposedDelta": null }
    q
    b
    m
    • 4
    • 6
  • w

    witty-keyboard-20400

    09/29/2021, 8:32 AM
    I'm making progress to wrap my mind around the thought process given in designing the metadata model. Traditionally, a "Read-only" thing means it is "present" to begin with. Consider this statement from the link: https://linkedin.github.io/rest.li/Validation-in-Rest_li#restli-validation-annotations "For example, 
    @ReadOnly
     should only be used to enforce that an optional field is not present. It should not be specified for a required field, making missing required field value valid."
    It felt awkward. In fact I didn't understand these 2 sentences. It's hard for me to align my general understanding with this seemingly new way to describe ReadOnly and CreateOnly validations.
    b
    • 2
    • 1
  • w

    witty-keyboard-20400

    09/29/2021, 12:14 PM
    Is there a way to get list of all entities currently ingested in the Datahub? This command :
    Copy code
    curl --location --request GET '<http://localhost:8080/entities/urn%3Ali%3Achart%3Acustomers>'
    is for specific type of entity - Chart. To get list of all the entities, I tried:
    Copy code
    curl --location --request GET '<http://localhost:8080/entities/>'
    ..but it resulted in error 500 with the message that GET op is not supported.
    b
    • 2
    • 1
  • w

    witty-keyboard-20400

    09/29/2021, 12:21 PM
    There should have been one standalone and fully complete end to end tutorial describing with actual json, how to create new entities, how to define relationship between those entities and all sorts of Aspects. Effectively a self-contained case study in the documentation.
    b
    m
    b
    • 4
    • 4
  • w

    witty-keyboard-20400

    09/29/2021, 12:42 PM
    How is the following line read out?
    urn:li:dataset:(urn:li:dataPlatform:foo,bar,PROD)
    I understand the initial parts, urn: just a prefix for this sort of notation. li: namespace. dataset: entity type. Are the remaining parts
    (urn:li:dataPlatform:foo,bar,PROD)
    a notation for an Entity ID or Aspect or what?? What are the 3 arguments?
    m
    • 2
    • 2
  • w

    witty-keyboard-20400

    09/30/2021, 12:22 PM
    Consider the following aspect definition from https://datahubproject.io/docs/metadata-modeling/extending-the-metadata-model/
    Copy code
    namespace com.linkedin.metadata.key
    
    @Aspect = {
      "name": "dashboardKey",
    }
    record DashboardKey {
      @Searchable = {
        ...
      }
      dashboardTool: string
    
      dashboardId: string
    }
    The Urn representation of the Key shown above would be:
    urn:li:dashboard:(<tool>,<id>)
    Question: in the just above line, where is the 3rd component, i.e.
    dashboard
    declared as a type in the Key Aspect
    dashboardKey
    or entity definition itself?
    ✅ 1
    🙌 1
    m
    • 2
    • 1
  • w

    witty-keyboard-20400

    09/30/2021, 2:00 PM
    While a "contains" Relationship is appropriate between a Dashboard and Charts, because typically a chart is part of only 1 dashboard and a Chart doesn't stand on its own (it must belong to a Dashboard), such relationship is not valid between questions and a Test. A Question can be part of different Tests organized over time as well as might come during a practice session. Any idea what sort of relationship would be appropriate between Test and Question here?
    b
    • 2
    • 3
  • w

    witty-keyboard-20400

    10/01/2021, 11:09 AM
    In the first MCE in the bootstrap_mce.json
    Copy code
    {
        "auditHeader": null,
        "proposedSnapshot": {
          "com.linkedin.pegasus2avro.metadata.snapshot.CorpUserSnapshot": {
            "urn": "urn:li:corpuser:datahub",
            "aspects": [
              {
                "com.linkedin.pegasus2avro.identity.CorpUserInfo": {
                  "active": true,
                  "displayName": {
                    "string": "Data Hub"
                  },
                  "email": "<mailto:datahub@linkedin.com|datahub@linkedin.com>",
                  "title": {
                    "string": "CEO"
                  },
                  "fullName": {
                    "string": "Data Hub"
                  },
                }
              }
            ]
          }
        },
        "proposedDelta": null
      },
    CorpUserKey
    (with a field
    username
    ) is the Key Aspect for the entity
    CorpUserSnapshot
    (as in the definition of CorpUserSnapshot.pdl). But I don't see any
    username
    field and value in this JSON element. Could anyone help me understand this anomaly? @mammoth-bear-12532 @big-carpet-38439
    b
    • 2
    • 5
  • w

    witty-keyboard-20400

    10/01/2021, 2:02 PM
    I see
    Copy code
    "owners" : [ "urn:li:corpuser:fbar", "urn:li:corpuser:bfoo" ],
    in the file DatasetUrn.pdl . What does it mean to have 2 specific "owners" here inside the schema of DataSetUrn? That too with IDs like "fbar" and "bfoo" ?
    b
    • 2
    • 2
  • w

    witty-keyboard-20400

    10/04/2021, 12:20 PM
    Question in
    bootstrap_mce.json
    regarding datasets appearing under browse path "prod":
    Copy code
    "com.linkedin.pegasus2avro.metadata.snapshot.DatasetSnapshot": {
            "urn": "urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)",
            "aspects": [
              {
                "com.linkedin.pegasus2avro.common.Ownership": {
                  "owners": [
                    {
                      "owner": "urn:li:corpuser:jdoe",
                      "type": "DATAOWNER",
                      "source": null
                    },
                    {
                      "owner": "urn:li:corpuser:datahub",
                      "type": "DATAOWNER",
                      "source": null
                    }
                  ],
                  "lastModified": {
                    "time": 1581407189000,
                    "actor": "urn:li:corpuser:jdoe",
                    "impersonator": null
                  }
                }
              },
              {
                "com.linkedin.pegasus2avro.dataset.UpstreamLineage": {
    For the above mentioned Hive dataset, there is no mention of following type of
    BrowsePaths
    the ways Kafka and Hdfs DatasetSnapshot.
    Copy code
    "com.linkedin.pegasus2avro.common.BrowsePaths": {
                  "paths": ["/prod/kafka/Sample..."]
                }
    How is the path
    /prod/hive/SampleHiveDataset
    appearing on the UI ? OTOH, I see several ML snapshot mentioning browse paths, but those don't appear on the UI. e.g.
    Copy code
    "com.linkedin.pegasus2avro.metadata.snapshot.MLFeatureTableSnapshot": {
            "urn": "urn:li:mlFeatureTable:(urn:li:dataPlatform:feast,test_feature_table_no_labels)",
            "aspects": [
              {
                "com.linkedin.pegasus2avro.common.BrowsePaths": {
                  "paths": ["/feast/test_feature_table_no_labels"]
                }
              },
    What is the criteria on which some are displayed under browse paths on the UI?
    g
    • 2
    • 2
  • g

    gray-barista-29387

    10/04/2021, 10:52 PM
    Hi everyone, was wondering where can I get more information about the search capabilities and features that Datahub hass, I've been reading about the different microservices that compose Datahub, but need more specific about the search workflow, any documentation related would be awesome, thanks.
    l
    • 2
    • 1
  • k

    kind-dawn-17532

    10/04/2021, 11:00 PM
    Hi All! I just started playing with DataHub in last few days and I have a few newbie questions: 1. The search apparently is at dataset level, is there no direct way to search for all dataset that have a specific column? 2. In my quickstart docker instance, i see that the Lineage, Queries and Stats tabs are disabled - I am wondering is it because I am not using a Neo4j backend? If not, is there a way i can push these information? 3. Currently I am using postgres source to ingest metadata from Greenplum (since it is based on Postgres). I would like to tweak this source to say Greenplum with Greenplum icon, rather than postgres to avoid confusion for my users.. What is a quick way to tweak this behavior? Modify postgres source, create a new Greenplum source based on postgres source? Thanks for your inputs!
    g
    w
    m
    • 4
    • 12
  • b

    bland-orange-13353

    10/06/2021, 1:25 PM
    This message was deleted.
    r
    • 2
    • 1
  • c

    clean-monitor-43741

    10/07/2021, 4:24 PM
    Hi everyone! I’m starting to learn to use DataHub but I’m having trouble following the quickstart due to this error on the apple m1 chip:
    ERROR: no matching manifest for linux/arm64/v8 in the manifest list entries
    . Can someone point me the direction on how to quickstart using that
    postgres-setup
    instead?
    plus1 1
    b
    a
    +3
    • 6
    • 12
  • v

    victorious-stone-56510

    10/09/2021, 5:43 PM
    Hello! Tell me, please, how can I get metadata values through graphql if I know the schema?
    g
    m
    b
    • 4
    • 7
  • f

    fast-winter-10784

    10/11/2021, 4:59 PM
    Hello all! Glad to be with you all. I am a data scientist working at the University of Kansas Center for Public Partnerships & Research (KU-CPPR). We are looking for a centralized metadata management tool for our organization and are interested in DataHub. We are reviewing your documentation, but we also have a list of questions to which we would love some clarification. Our key questions: 1. Can DataHub be used as a metadata tool WITHOUT storing the data? Also, if we decide to later add data to an environment with DataHub, how easy would that be? 2. Can DataHub auto-generate metadata or have a permanent link to data sources (for automated updating) WITHOUT jeopardizing data security/privacy? 3. How easy is it to edit the metadata after creation so that we have a living tool? 4. Is there a version control mechanism to not overwrite old comments/edits when uploading new versions of data dictionaries? 5. Can the DataHub tool search, filter, tag, add notes, track data providence? 6. What are other key features of DataHub you would say make it stand out from other metadata products? 7. What else should we be thinking about when thinking about metadata management / data governance? If you have a moment to respond to any of the above questions, our team would love to hear your thoughts! Thank you all very much
    m
    m
    k
    • 4
    • 4
  • a

    agreeable-hamburger-38305

    10/11/2021, 5:29 PM
    Hi all, wondering if anyone knows which features in the Q3 2021 Roadmap has already been implemented https://github.com/linkedin/datahub/blob/master/docs/roadmap.md
    m
    l
    • 3
    • 4
  • w

    witty-keyboard-20400

    10/12/2021, 5:34 AM
    When we ingest metadata from different source systems in an organization, it's possible that at the origin a field is called "score", while downstream system may name the same field as "marks". How do we handle it in DataHub to convey to a user that both are same fields? Is there any way to capture semantics in DataHub?
    b
    • 2
    • 4
  • w

    witty-keyboard-20400

    10/12/2021, 12:38 PM
    Not able to retrieve entity details through Metadata Service. When I try to retrieve an entity by executing:
    curl '<http://localhost:8080/entities/urn:li:dataset:(urn:li:dataPlatform:cg,kv_entity,PROD)>'
    I see only this much:
    Copy code
    {
      "value": {
        "com.linkedin.metadata.snapshot.DatasetSnapshot": {
          "urn": "urn:li:dataset:(urn:li:dataPlatform:cg,kv_entity,PROD)",
          "aspects": [
            {
              "com.linkedin.metadata.key.DatasetKey": {
                "origin": "PROD",
                "name": "kv_entity",
                "platform": "urn:li:dataPlatform:cg"
              }
            }
          ]
        }
      }
    }
    However, when I created this cg dataset snapshot, I had this structure:
    Copy code
    "auditHeader": null,
      "proposedSnapshot": {
        "com.linkedin.pegasus2avro.metadata.snapshot.DatasetSnapshot": {
          "urn": "urn:li:dataset:(urn:li:dataPlatform:cg,kv_entity,PROD)",
          "aspects": [
            {
              "com.linkedin.pegasus2avro.common.BrowsePaths": {
                "paths": ["/prod/cg/kv_entity"]
              }
            },
            {
              "com.linkedin.pegasus2avro.dataset.DatasetProperties": {
                "description": "kv_entity collections",
                "tags": [],
                "customProperties": {
                  "db_cluster_setup_confluence_link": "https://<....wiki link here...>",
                  "doc_author": "<mailto:abc.aaa@example.com|abc.aaa@example.com>"
                }
              }
            },
            {
              "com.linkedin.pegasus2avro.common.Ownership": {
                "owners": [
                  {
                    "owner": "urn:li:corpuser:cg_owner",
                    "type": "DATAOWNER",
                    "source": null
                  }
                ],
                "lastModified": {
                  "time": 1633345222224,
                  "actor": "urn:li:corpuser:cg_owner",
                  "impersonator": null
                }
              }
            },
            {
              "com.linkedin.pegasus2avro.common.InstitutionalMemory": {
                "elements": [
                  {
                    "url": "<https://wiki> link to cg",
                    "description": "Business Requirements for CG",
                    "createStamp": {
                      "time": 1581407189000,
                      "actor": "urn:li:corpuser:cg_owner",
                      "impersonator": null
                    }
                  }
                ]
              }
            },
            {
              "com.linkedin.pegasus2avro.schema.SchemaMetadata": {
                "schemaName": "kv_entity",
                "platform": "urn:li:dataPlatform:cg",
                "version": 0,
                "created": {
                  "time": 1581407189000,
                  "actor": "urn:li:corpuser:cg_dev_01",
                  "impersonator": null
                },
                "lastModified": {
                  "time": 1581407189000,
                  "actor": "urn:li:corpuser:cg_dev_02",
                  "impersonator": null
                },
                "deleted": null,
                "dataset": null,
                "cluster": null,
                "hash": "",
                "platformSchema": {
                  "com.linkedin.pegasus2avro.schema.KafkaSchema": {
                    "documentSchema": "{\"type\":\"record\",\"name\":\"KVEntityCodes\",\"namespace\":\"com.linkedin.dataset\",\"doc\":\"KV Entity codes\",\"fields\":[{\"name\":\"tenant_id\",\"type\":[\"number\"]},....]}"
                  }
                },
                "fields": [
                  {
                    "fieldPath": "[version=2.0].[type=int].tenant_id",
                    "jsonPath": null,
                    "nullable": false,
                    "description": {
                      "string": "Tenant Id originated from .."
                    },
                    "type": {
                      "type": {
                        "com.linkedin.pegasus2avro.schema.NumberType": {}
                      }
                    },
                    "nativeDataType": "int",
                    "globalTags": {
                      "tags": [{ "tag": "urn:li:tag:NeedsDocumentation" }]
                    },
                    "recursive": false
                  },
                  {
                    ...
                  }
                ]
                }
              }
            ]
          }
        }
      }
    b
    b
    • 3
    • 9
1...131415...80Latest